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(54) Integrated retrieval scheme for retrieving semi-structured documents 



(57) An integrated retrieval scheme retrieves data 
involved in a plurality of semi'Structured documents 
scattering over open networks and collects the required 
information item by item from the semi-structured docu- 
ments through a unified interlace without regard to dif- 
ferences in the document structures, presentation 
styles, and elements of the semi-sti^uctured documents. 

The search scheme receives a query consisting of 
search items and search conditions from a user (S200). 
The search scheme finds, according to location data 
tiiat specifies the location of each of tiie semi-structured 
documents, the location of each semi-structured docu- 
ment that contains all search items (8210) and con- 
verts, if necessary, item presentation styles of the 
entered query into that of the location found semi-struc- 
tured documents according to style conversion data 
(8220,8225,8230), and forms queries for the location 
found semi-structured documents, and transmits the 
queries to the found locations and obtains the location 
found semi-structured documents (8240), and extracts 
Item data from the obtained semi-structured documents 
according to structure data being used to delimit docu- 
ment into items and attribute data being used for condi- 
tional retrieval, and prepares a search result (S240). 
and converts, if necessary, item presentation styles of 
the search result into the item presentation styles of 
each user according to the style conversion data 
(S250). 
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Description 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0001] The present invention relates to a retrieval 
technique applied to an open network environment that 
involves a plurality of semi-structured documents and 
search engines. In particular, the present invention 
relates to an integrated retrieval scheme by managing 
the location data, document structure data, item data, 
presentation style data, etc., to provide a unified inter- 
face tor retrieving required information item by item from 
a plurality of semi-structured documents irrespective of 
differences among the locations, document structures, 
elements. Input forms of search engines. 

2. Description of the Prior Art 

[0002] Inaeasing performance and decreasing cost in 
personal computers, improvements in network technol- 
ogy, and the growth of inexpensive network providers 
are vitalizing open networks. In particular, the Internet. 
Many information providers employ HTML (hypertext 
markup language), that is description language of hyter- 
text for reafizing easy contents creation, to transnvt var- 
ious informations to users through the open networks. 
The number of information providers is increasing due 
to an exploding increase in information consumers. This 
results in accumulating various kinds of informatfon in 
the networks, and it is required to efficiently provkie 
each consumer with necessary information from among 
the accumulated pieces of Information. 
[0003] The consumers want to entirely retrieve 
desired information from across information sources. It 
Is hardly granted because informatfon accumulated in 
the open networks is mostly in HTML documents that 
have mutually different structures, presentation styles, 
or search formats to retrieve devised information from 
across different information sources. 
[0004] Inlbrmation retrieval apparatus, so called, 
search engines are widely used with respect to retriev- 
ing HTML documents scattered over tiie network. Here, 
the search engine is a generic term for system retrieving 
certain information through input form. Rgure 1 shows 
an information retrieval technk^ue according to a prior 
art using URL search engine. The URL search engine is 
a search engine returning URL as search result with 
respect to query with keyword or conditional term. For 
example, a user has an interest in "a PC of 100.000 yen 
or below." The user enters keywords into an URL search 
engine. Figure 2 shows an example of an URL search 
engine according to a prior art The URL search engine 
900 has a keyword index 910 tiiat contains keywords 
and locations, i.e., URLs related to HTML documents 
spreading over networks, the keyword index 910 is reg- 
istered in advance. A search processor 930 searches 



ttie keyword index 910 for the keywords entered by tiie 
user and returns a list of URLs and outlines, the URL 
indicates location of HTML documents tiiat contain ttie 
entered keywords and its synonym. Returning to Fig. 1. 
5 tiie user accesses tiie returned HTML documents one 
by one to find out necessary information. In this way, 
first the users had to find out the tocations of HTML 
documents that may contain necessary information by 
wide document search, and tiien inspect each of the 

w HTML documents in obtained URL list for the necessary 
information when obtaining the information from HTML 
documents of which is unknown, so that it needs long 
time and labor to obtain necessary Information. The 
users must spend much time and labor beforie tiiey get 

15 necessary information. In addition, the prior arts are 
incapable of collectively retrieving information from 
across a plurality of HTML documents. 
[0005] The priOr arts may find out the locations of 
HTML documents that contain given keywords and the 

20 synonyms thereof iHJt are unable to collect information 
item by item by collectively retiieving information 
involved in HTML documents. The prior arts are unable 
to set conditions on search results. For example, th^ 
are un^le to fitter search results by date. And. when 

25 using URL search engine that provides search Interface 
for each HTML document as input form, users must take 
into account such individual form input interlace for 
each URL search engine and access each URL search 
engine one by one. 

30 [0006] More particulariy, HTML documents employed 
in on-line shops of electronic commerce frequently 
show the product infonnation such as names and prices 
with list desaiption of table or clause style that includes 
one meaningful clustered data. There are demands to 

35 retrieve information collectively among these HTML 
documents of on-line shops. For example, a user may 
want to retrieve information about shops tiiat offer the 
fowest price for a specific product. In this case, ttie user 
enters tiie name, maker, category, etc.. of the product 

40 as keywords. Then, the prior art of Fig. 1 provides the 
user with the locations of HTML documents related to 
ttie keywords. The user accesses ttie HTML documents 
one by one to check to see if ttiey offer ttie product 
under preferable conditions. The prior art of Rg. 1 , how- 

45 ever, searches the full text of each HTML document for 
ttie entered keywords without considering elements that 
form the HTML document, and ttierefore, tends to 
retrieve a lot of inrelevant data for the user. Accordingly, 
the user must spend much time and labor to find out the 

50 necessary information from among the HTML docu- 
ments retrieved by the prior art. 
[0007] The prior arts are incapable of retrieving infor- 
mation from a given HTML document item by item. For 
example, they are unable to extract ttie price, image, 

55 maker, etc.. of a given product from a given HTML doc- 
ument containing product information table. The prior 
arts are unable to extract ttie name, phone nuniser. 
address, etc., of each shop from a given HTML docu- 
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ment containing claused-shop information. The prior 
arts are unable to set conditions such as date to filter 
results retrieved from HTML documents. 
[G008] There is a conventional technique that creates 
a hypothetical database by mapping the internal struc- s 
tare of each document and relationships k>etween docu- 
ments into unique models, to extract itemized pieces of 
infonnation. This technique was disclosed by N. Ashish 
and C. A. Knoblock In "Semi-automatic wrapper gener- 
ation for internet information sources/ Proceedings of 10 
Cooperative Information Systems, 1997. This technique 
considers a portion in HTML document as meaningful 
Information, the portfon has ^ectf ic tags such as TITLE 
tag such as size, color, typestyle (e.g., bold and italic), 
and extracts these information automatically. This tech- 15 
nlque cover a case that minimum cluster of certain infor- 
mation is described in one HTML document, and a 
plurality of the HTML documents are described in mutu- 
ally same format. This technique is, for example, effec- 
tive when regionalized weather information is described 20 
in different HTML documents. However, this technique 
doesnt take into account a case that infomiatlon Is 
described as a list description such as table or clause in 
one HTML document. Accordingly, this technique is 
unable to be applied to the above case. 25 
[0009] J. Hammer, H. Qarcia-Mollna, J. Cho. R. Araha, 
and A. Crespo disclosed another technique in "Extract- 
ing semistructured information from the web." Work- 
shop on Management of Semistructured Data. 1997. 
This technique creates a hypothetical database by 30 
employing an unique OEM data model, and manage 
relationship between the database and various informa- 
tion sources, and therefore, retrieve information from 
heterogeneous web sources integratively. This tech- 
nique employs template file depending on HTML tag 35 
description rule for HTML document to manage above 
relationship. However, In this technique, modlf k»tion in 
HTML document affect hypothetical database and also 
modification in hypothetical database affect applk^tion. 
Accordingly, this technique need much labor for man- 40 
agement and maintenance of system. 
[001 0] There are no standards for HTML descriptfons 
used for information provkJing such as products handled 
by on-line shops. Namely, on-line shops are using indi- 
vidual HTML documents. This will be explained. 45 
[001 1] HTML documents prepared by on-line shops 
have different document structures. For example, a 
shop A employs a tag TABLE to describe products in 
table fomfiat, while a sfiop B employs a tag UL to itemize 
products in clause format. so 
[0012] The HTML documents of on-line shops employ 
different presentation styles even for the same product. 
For example, yen, thousand yen, ten-thousand yen. dol- 
lars, etc.. are used as unit prices depending on shops. 
Some shops use double-byte characters to express 55 
prices and others employ single-byte characters for the 
same purpose. 

[0013] The HTML documents of on-line shops have 
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different data elements even for the same product. For 
example, a product is represented with only the name 
thereof, or the name and model number thereof, or the 
maker, name, and model number thereof depending on 
shops. To get necessary information from HTML docu- 
ments gathered by the conventional retrieval tech- 
nkiues, users must extract pieces of information from 
the documents and conpare them with one another. It 
takes a long time and much latx>r to retrieve necessary 
data from them. 

[0014] In addition, when using plural search engines, 
the search engines used to search open networks for 
required information differ from one another in informa- 
tion types to handle, performance, and fees, and there- 
fore, the users must choose them depending on 
situations. In othenwise, for this purpose, the users must 
know the locations, and interfeces of the search engines 
peculiarly 

[001 5] Rrst it is difficult to find and manage the loca- 
tions of search engines. The users must indivulually . 
manage the locations of search engines with the use of, 
for example, bookmarks. TNs Is hard to achieve in an 
environment using all tern^nal but own terminal, such 
as motile environment. 

[0016] Second, the search interfaces of search 
engines provkied by input forms are not unified. Many 
search engines emptoy their own input forms of which 
structure are not unified. Accordingly, the users must 
acquire separate systems and operation sequences 
and schemes when handling different search engines. It 
is hard for the users to know which search engine is 
effective for certain search Hem. It is also hard for the 
users to process information conditionally contained in 
retrieved HTML documents. 

[0017] Third, the search information through search 
engines are inefficient. The users must handle several 
search engines until they get required information. This 
involves many search operations and is inefficient. 
[0018] Fourth, the search engines return search result 
that is different item presentation styles, character 
codes, etc. , when presenting search results, and it is dif- 
ficult for the users to compare the search results with 
one another. 

[0019] To solve the heterogeneity among the search 
engines, Junrton World Seek at http://mem- 
ber.nifty.ne.jp/iumon has disclosed a technique of pre- 
paring a common search intertace for URL search 
engines that is one kind of search engme, managing 
relationships between the common search interface 
and individual interface for URL search engines, con- 
verting a search request for the common search inter- 
face into search requests for the search engines, and 
executing the search requests for the search engines. 
This technique provides the common search interface 
employing a single text box to handle the URL search 
engines. In practice, there are not only the URL search 
engines but also other various search engines. To use 
such a variety of search engines, this technique has the 
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following problems: 

(1) Necessity off consideiing a plurality of input items 

[0020] Some search engines employ a simplest input 
form with a single text box for entering ke^rds to 
search. To narrow information to retrieve, some search 
engines allow the users to enter search conditions such 
as an area and an industry field in addition to keywords. 
However, the technique mentioned above Is Incapable 
of achieving such a narrowing search opmition 
because it does not support a plurality of Input items. 

(2) Necessity of coping with a variety of Input forms 

[0021] To properly enter search conditions, some 
search engines employ several input form objects fbr 
text input such as text boxes, radio buttons for selecting 
one among several items, and select boxes or check 
btatBS fbr selecting some among several items. The 
technique mentioned above is incapable of coping with 
these data entering objects except for text box because 
it supports only a single X&dt box. 

(3) Reconstructnn of application 

[0022] When adding, correcting, deleting search 
engines with respect to the common search interface, 
the technique mentioned above must con-ect the com- 
mon search interface and reconstruct conresponding 
applications. 

[0023] In this way. the conventional technique men- 
tioned above is incapable of coping with a variety of 
search engines and needs a lot of time and labor to 
design, maintain, and manage. 

SUMMARY OF THE INVENTION 

[0024] An object of the present invention is to provide 
an integrated retrieval scheme capable of retrieving 
required information from a plurality of semi-structured 
documents such as HTML documents that are scatter- 
ing over open networks and have different document 
structures, presentation styles, and information ele- 
ments, converting the retrieved information into a uni- 
fied form for each user, and returning the infbrmatfon in 
the unified form to the user. 

[0025] Another object of the. present invention is to 
provide an integrated retrieval scheme capable of indi- 
vklually managing input form objects of each search 
engine serving for open networks to resolve differences 
among the search engines, generating search requests 
specific to the search engines according to a user's 
search request, and executing search operations with 
respect to the search engines in open network environ- 
ment including many search engines. 
[0026] Still another object of the present invention is 
to provide an integrated retrieval scheme capable of 



managing the location, document structure, and item 
attributes of each HTML document and extracting 
required information item by item from different HTML 
documents that differs in the location, the document 
5 structure, and attributes arbitrary. 

[0027] In order to accomplish the objects, an aspect of 
the present invention provides an apparatus for retriev- 
ing data contained in a plurality of semi-stmctured doc- 
uments over open networks, comprising: a unit for 
10 storing meta data for each of ttie semi-structured docu- 
ments, the meta data including items to be extracted 
from ttie semi-structured documents and item data 
used to conditionally retrieve the items; a unit for retriev- 
ing data scattered among ttie semi-structured docu- 
15 merits for entered query according to the meta data, 
and preparing a collective search resutt; and a unit for 
outputting ttie search result 'm a prescribed single for- 
mat ttiat is specific to each user. 
{0028] Another aspect of the present Invention pro- 
20 vides an apparatus for retrieving data contained in a plu- 
rality of semi-structured documents over open 
networks, comprising: (a) a unit for storing location data 
about the location of each of the s«mi-structured docu- 
ments, document structure data atK>ut ttie structure of 
25 each of the semi-structured documents, used to delimit 
document into items to be extracted, attribute data 
about the attrft)utes of each of ttie items to be extracted, 
used to conditionally rettieve the items, and style con- 
version data used to convert item presentation styles of 
30 the user and item presentation styles of the semi-struc- 
tured documents from one into another; (b) a unit for 
finding, according to ttie location data, the tocatfon of a 
semi-sfructured document that contains all search 
items specified in an entered query that consists of ttie 
35 search items and search conditions; (c) a unit for con- 
verting, if necessary, item presentation styles of ttfie 
entered query into item presentation styles of ttie 
search item in location found semi-slojclured docu- 
ments according to tiie style conversion data, and form- 
40 ing queries for ttie location found semi-structured 
documents; (d) a unit for ttansmitting ttie queries pro- 
vided by the unit (c) to ttie found locattons and acquiring 
ttie semi-structured documents; (e) a unit for extracting 
item data from the acquired semi-sb\jctured documents 
45 according to the document structure data, selecting tiie 
extracted item data, if necessary, according to ttie 
attribute data fbr ttie search condition, and preparing a 
search resutt; and (f) a unit for converting, if necessary, 
item presentation styles of tiie search result into the 
50 item presentation styles of each user according to ttie 
style conversion data. 

[0029] Still anottier aspect of tiie present invention 
provides an apparatus for retrieving data ttirough 
search engines over open networks, comprising: (aa) a 
55 unit for storing location data about ttie location of each 
search engine, essential input item data specifying 
essential input items required by an input form of each 
search engine, document structure data about the 
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structure of each HTML document, used to delimit doc- 
ument into items to be extracted, attribute data about 
the attributes of the items to be extracted, used to con- 
ditionaliy retrieve the items, and style conversion data 
used to convert item presentation styles of a user and 
item presentation styles of each HTML document from 
one into another; (bb) a unit for finding, according to the 
location data, the location of a search engine that con- 
tains all search items specified in an entered query that 
consists of the search items and search conditions; (cc) 
a unit for selecting, according to the essential input item 
data, search engine to be searched from among the 
location found search engines, the search engine of 
which the essential input item satisfy the specified 
search condition; (dd) a unit for determining an optimum 
retrieval pattern for each of the selected search engines 
accading to a matrix table and converting the entered 
query into queries for the selected search engines 
accordingly, the matrix table defining combination 
between the search items and search conditions and 
the items and essential input items of each search 
engine: (ee) a unit for converting, rf necessary, item 
presentation styles of the queries provided by the unit 
(dd) into item presentation styles of the search Item In 
selected search engines according to the style conver- 
sion data; (ff) a unit for transmitting the queries provided 
by the unit (ee) to the found locations and acquiring 
HTML documents; (gg) a unit for extracting item data 
from the acquired HTML document serving as a first 
search result according to the structure data, selecting 
the extracted item data, if necessary, according to the 
attrSxite data for the search condition on the basis of 
corresponding retrieval pattern and preparing a second 
search result; and (hh) a unit for converting. If neces- 
sary, item presentation styles of the second search 
result into item presentation styles of each user accord- 
ing to the style conversion data. 
[0030] Still another aspect of the present invention 
provides an apparatus for extracting data item by item 
from arbitrary HTML document over open networks, 
comprising: (aaa) a unit for storing a template for each 
HTML document according to document structure data 
atxjut the structure of the HTML document used to 
delimit document into items to be extracted, the tem- 
plate stipulating at least item name to be extracted and 
prescribed text extraction style data of item group to be 
extracted from the HTML document; (bbb) a unit for 
analyzing a template con-espondlng to acquired HTML 
document; and (ccc) a unit for comparing the acquired 
HTML documents with corresponding template by scan- 
ning the acquired HTML document, and extracting item 
data of the items matching the text extraction style data 
of the template, so as to prepare a search result. 
[0031] Still another aspect of the present invention 
provides a method of retrieving data contained in a plu- 
rality of semi-structured documents over open net- 
works, comprising the steps of: retrieving data scattered 
among semi-structured documents for entered query 



according to meta data about each of the semi-struc- 
.tured documents and preparing a collective search 
result, the meta data including items to be extracted 
from the semi-structured documents and item data 
5 used to conditionally retrieve the items; and outputting 
the search result in a prescribed single format that is 
specif to each the user. 

[0032] Still another aspect of the present invention 
provides a method of retrieving data contained in a plu- 

10 ralrty of semi-structured documents over open net- 
works, comprising the steps of: (a) finding, according to 
location data that specifies the location of each of the 
semi-structured documents, the location of a semi- 
structured document that contains all search items 

IS specified in an entered that consists of the search items 
and search conditions; (b) converting, If necessary, item 
presentation styles of the entered query into item pres- 
entation styles of the search item in location found semi- 
structured documents according to style conversion 

20 data and forming queries for the location found semi- 
structured documents, the style conversion data being 
used to convert item presentation styles of a user and 
Item presentation styles of the semi-structured docu- 
ments from one into another; (c) transmitting the que- 

25 ries provided by the step b) to the fburxl locattons and 
acquiring the semi-structured documents; (d) extracting 
item data from the acquired senn-structured documents 
according to document structure data, selecting the 
extracted item data, if necessary, according to atbitsute 

30 data for the search condition and preparing a search 
result the document structure data specifying the struc- 
ture of each of the semi-structured documents and 
being used to delimit document into items to be 
extracted, the attribute data specifying the attributes of 

35 each item to be extracted and being used to condition- 
ally retrieve the Items; and (e) converting, if necessary, 
item presentation styles of the search result into the 
item presentation styles of each user according to the 
style conversion data. 

40 [0033] Still another aspect of the present invention 
provides a method of retrieving data through search 
engines over open networks, comprising the steps of: 
(aa) finding, accorcling to location data that specifies tiie 
location of each search engine, tiie location of a search 

45 engine that contains all search items specified in an 
entered query tiiat consists of the search items and 
search conditions; (bb) selecting, according to essential 
Input Item data that specifies essential input items 
required by an input form of each search engine, search 

50 engine to be searched from among the location found 
search engines, tiie search engine of wNch the essen- 
tial input item satisfy tiie specified search condition; (cc) 
determining an optimum retrieval pattern for each of the 
selected search engines according to a mati'ix table and 

55 converting the entered query into queries for the 
selected search engines accordingly, the matrix table 
defining combination between the search items and 
search conditions and tiie Items and essential input 
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items of each search engine; (dd) converting, if neces- 
sary. Item presentation styles of the queries provided by 
the step (cc) into item presentation styles of the search 
item in selected search engines according to style con- 
version data that is used to convert item presentation 
styles of a user and item presentation styles of each 
HTML document from one into another; (ee) transmit- 
ting the queries obtained by the step (dd) to the found 
location and acquiring HTML documents; (ff) extracting 
item data from the acquired HTML document sending as 
first search result according to document structure data, 
selecting, if necessary, the extracted item data accord- 
ing to attritxjte data for the searching condition on the 
k>asis of corresponding retrieval pattern, and preparing 
a second search result, the document structure data 
specifying the structure of each HTML document and 
being used to delimit document into items to be 
extracted, the attribute data specifying the attributes of 
the items to be extracted and being used to condition- 
ally retrieve the items; and (gg) converting, if necessary, 
item presentation styles of the second search result into 
item presentation s^es of each user aocorcfing to the 
style conversion data. 

[0034] Still another aspect of the present invention 
provides a method of extracting data item by item from 
arbitrary HTML document over open networks, compris- 
ing the steps of :(aaa) analyzing a template con^espond- 
ing to acquired HTML document, the template for each 
HTML document being set according to document 
structure data that specifies the structure of each HTML 
document and is used to delimit document into items to 
be extracted, the templates stipulating at least item 
name to be extracted and prescribed text extraction 
style data of item group to be extracted from the con-e- 
sponding HTML document; and (bbb) comparing the 
acquired HTML documents with conresponding tem- 
plate by scanning the acquired HTML document, and 
extracting item data of the items watching the text 
extraction style data of the template, so as to prepare a 
search result. 

[0035] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the conputer to exe- 
cute processing for retrieving data contained in a plural- 
ity of semi-structured documents over open networks, 
the processing including: a process for retrieving the 
data scattered among semi-structured documents for 
entered query according to meta data about each of the 
semi-structured documents and preparing a collective 
search result, the meta data including items to be 
extracted from the semi-structured documents and item 
data used to conditionally retrieve the items; and a proc- 
ess for outputting the search resuK in a presaibed sin- 
gle format that is specific each the user. 
[0036] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 
cute processing for retrieving data involved in a plurality 



of semi-structured documents over open networks, the 
processing including: (a) a process for finding, accord- 
ing to location data that specifies the location of each of 
the semi-structured documents, the location of a semi* 

5 structured document that contains all search items 
specified In an entered that consists of the search items 
and search conditions; (b) a process for converting, if 
necessary, Item presentation styles of the entered 
query into item presentation styles of the search item in 

10 location found semi-structured documents according to 
style conversion data and forming queries for the loca- 
tion found semi-structured documents, the style conver- 
sion data t>eing used to convert item presentation styles 
of a user and item presentation styles of the semi-struc- 

75 tured documents from one into another; (c) a process 
for transmitting the queries provided by the process (b) 
to the found locations and acquiring the semi-structured 
docimierrts; (d) a process for extracting item data from 
the acquired semi-structured documents according to 

20 document structure data, selecting the extracted item 
data, if necessary, according to attribute data for the 
search condition and preparing a search result, the doc- 
ument structure data specifying the structure of each of 
the semi-structured documents and being used to 

25 delimit document into items to be extracted, the attribute 
data specifying the attributes of each item to be 
detracted and being used to conditionally retrieve the 
items; and (e) a process for converting, if necessary, 
Item presentation styles of the search result into the 

30 item presentation styles of each user according to the 
style conversion data. 

P)037] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 

35 cute processing for retrieve data through search 
engines over the open networks, the processing includ- 
ing: (aa) a process for finding, according to location data 
that specifies the location of each search engine, the 
location of a search engine that contains all search 

40 items specified in an entered query that consists of the 
search items and search conditions; (bb) a process for 
selecting, according to essential input item data that 
specifies essential input items required by an input form 
of each search engine, search engine to be searched 

45 from among the location found search engines, the 
search engine of which the essential input Hem satisfy 
the specified search condition; (cc) a process for deter- 
mining cm optinwm retrieval pattern for each of the 
selected search engines according to a matrix table and 

50 converting the entered query into queries for the 
selected search engines accordingly, the matrix table 
defining oon4)ination between the search items and 
search conditions and the items and essential input 
items of each search engine; (dd) a process for convert- 

55 ing, if necessary, item presentation styles of the queries 
provided by the process (cc) into item presentation 
styles of the search item in selected search engines 
according to style oonverston data that is used to con- 
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vert item presentation styles of a user and item presen- 
tation styles of each HTML document from one into 
anotiier; (ee) a process for transmitting tlie queries 
obtained by the process (dd) to the found location and 
acquiring hiTML documents; (ff) a process for extracting 
item data from the acquired HTML document serving as 
first search result according to document structure data, 
selecting, if necessary, the extracted item data accord- 
ing to attribute data for the searching condition on the 
k)asis of con^esponding retrieval pattern, and preparing 
a second search result, the document structure data 
specifying the structure of each HTML document and 
being used to delimit document into items to be 
extracted, the attribute data specifying the attributes of 
the items to be extracted and being used to condition- 
ally retrieve the items; and (gg) a process for converting, 
if necessary, item presentation styles of the second 
search result into item presentation styles of each user 
according to the style conversion data. 
[0038] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 
cute processing for extracting data item by item from 
arbitrary HTML documents over open networks, the 
processing including: (aaa) a process for analyzing a 
template corresponding to acquired HTML document 
the template for each HTML document being set 
according to document structure data that specifies the 
structure of each HTML document and is used to delimit 
document into items to be extracted, the templates stip- 
ulating at least item name to be extracted and pre- 
scvbed text extraction style data of item group to be 
extracted from the corresponding HTML document; and 
(bbb) a process for comparing the acquired HTML doc- 
uments with conresponding the terrplate by scanning 
the acquired HTML document, and extracting item data 
of tiie items matching the text extraction style data of the 
template, so as to prepare a search result. 
[0039] Other and further objects and features of the 
present invention will become apparent from the follow- 
ing description taken in conjunction with the accompa- 
nying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0040] 

Figure 1 shows a sequence of processes for 

searching HTML documents for required infonna- 

tion according to a prior art; 

Fig. 2 shows the principle of a coiiventional search 

technique; 

Fig. 3 shows a sequence of processes for search- 
ing HTML documents for required information 
according to an integrated retrieval technique of the 
present invention; 

Fig. 4 shows the principle of the integrated retrieval 
of the present invention; 



Fig. 5 shows a HTML document integrated retrieval 
apparatus according to a first embodiment of the 
present invention; 

Rg. 6 shows the structure of a HTML document 
5 meta data storing unit arranged in the apparatus of 
Fig. 5; 

Fig. 7 is a flow chart showing a preparatory phase 
of the first embodiment; 

Fig. 8 is a flow chart showing an execution phase of 

10 the first embodiment; 

Rgs. 9A and 9B show the exemplary display and 
HTML description of an HTML document; 
Rgs. 10A and 10B show the display and HTML 
description of another HTML document; 

15 Rg. 1 1 shows an example of an HTML document 
table stored in tiie staing unit of Fig. 6; 
Rg. 12 shows an example of a HTML document to 
table mapping table stored in the storing unit of Rg. 
6; 

20 Rg. 13 shows an example of a HTML document 

item table stored in the storing unit of Fig. 6; 

Rg. 1 4 shows an example of a domain table stored 

in the storing unit of Rg. 6; 

Rg. 15 shows an example of a user domain table 
25 stored in the storing unit of Fig. 6; 

Rg. 16 shows an example of a domain oonversbn 

function table stored in the storing unit of Fig. 6; 

Rg. 17 shows an Internet information integrated 

retrieval accorcfing to a second embodiment of the 
30 present invention; 

Rg. 1 8 shows a HTML document meta data storing 

unit according to the second embodiment arranged 

in the apparatus of Rg. 17; 

Rgs. 19A. 19B. and 19C show examples of input 
35 forms of search engines according to the second 

embodiment; 

Rg. 20 shows an HTML descr^)tion coresponding 

to the input form of Rg. 1 98; 

Rg. 21 is a flow chart showing a preparatory phase 
40 Of the second embodiment; 

Rg. 22 shows an example of a HTML document 

item tak>le stored in the storing unit of Fig. 18; 

Rg. 23 shows an example of a HTML document 

table stored in tiie storing unit of Fig. 18; 
45 Rg. 24 shows an example of a HTML document to 

table mapping table stored in the storing unit of Rg. 

18; 

Rg. 25 shows an example of a domain table stored 

in the storing unit of Fig. 1 8; 
50 Rg. 26 shows an example of a domain conversion 

function table stored in the storing unit of Fig. 18; 

Rg. 27 shows an example of a user domain table 

stored in the storing unit of Fig. 18; 

Rg. 28 shows an exarrple of an essential item table 
55 stored in the storing unit of Fig. 1 8; 

Rg. 29 shows simplified relationships between the 

apparatus of the second enrtbocfiment and search 

engines in processing of search request; 
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Fig. 30 8how8 a search pattern matrix table accord- 
ing to the second embodiment; 
Rg. 31 is a flow chart showing an e)«cutlon phase 
of the second embodiment; 
Rg. 32 shows a location for data items in step S410 
ofRg.31; 

Rgs. 33 to 35 show retrieval pattern for pages A to 

C prepared in step S440 of Rg. 31 ; 

Rg. 36 shows relationships between user input 

domains and local domains prepared in step S450 

ofRg.31; 

Rgs. 37A and 37B show the exemplary display and 
HTML description of a search result from page B; 
Rg. 38 shows relationships between local domains 
and user output domains prepared in step 8500 of 

Rg. 31; 

Rg. 39 shows a HTML document information 

extraction apparatus according to a third eni)odi* 

ment of the present invention; 

Rg. 40 is a flow chart showing a preparatory phase 

of the third embodiment; 

Rg. 41 shows an example of a proxy setting file; 

Rg. 42 shows an example of a tenplate file; 

Rg. 43 shows an example of a URL-template table: 

Rg. 44 is a flow chart showing an execution phase 

of the third ent>odiment; 

Rg. 45 shows a display of an HTML document on a 
Web browser; 

Rg. 46 shows a part of HTML description corre- 
sponding to the display of Rg. 45; 
Rg. 47 shows a template file for extracting item 
data from the HTML document of Rg. 45. Rg. 46; 
Rg. 48 shows an example of extraction made from 
the HTML document of Rg. 45 according to the 
template file of Rg. 47; and 
Rg. 49 shows a display of an HTML document on a 
Web browser according to a modification of the 
third embodiment; 

Rg. 50 shows a display of an HTML document 
linked to the HTML document of Rg. 49 having a 
same structure as the HTML document of Fig. 49 
on a Web browser; 

Rg. 51 shows an HTML desaiption corresponding 
to the display of Rg. 49; and 
Rg. 52 shows an HTML description conresponding 
to the display of Fig. 50. 

DETAILED DESCRIPTION OF THE EMBODIMENTS 

[0041] Various embodiments of the present invention 
will be desaibed in detail with reference to the accom- 
panying drawings. In this specification, the semi-struc- 
tured documents include documents or other materials 
described in HTML (hypertext markup language). 
SGML (standard generalized markup language), XML 
(extensive markup language), etc. The explanation of 
the embodiments is based on HTML documents if not 
spedftoatly mentioned. Note that following embodi- 



ments are able to be applied to SGML document and 
XML document with appropriate modification. An input 
form provided by search engine for informatfon retrieval 
consist of HTML document. Therefore, the HTML docu- 

5 ments include these input forms furnished for search 
engines in following explanation. The present Invention 
is widely applicable to applications that utilize plural 
HTML documents that differ mutually in various aspects 
connected together through open networks. For exam- 

10 pie, the present invention is applicat}le to electronic 
commerce or information retrieval on electronic libraries 
and electronic catalogues. 

[0042] The principle of the semi-structured document 
integrated retrieval scheme of the present invention will 

IS be explained with reference to Rgs. 3 and 4. 

[Q043] Rg. 3 shows an image of operation sequence 
for user according to the present Invention. In Rg. 3. a 
user enters a search request for. for example, a PC of 
100,000 yen or below into an apparatus that realizes the 

20 integrated retrieval scheme of the present invention. 
The apparatus flexibly retrieves required information 
involved in HTMLdocuments and provkies the user with 
a collective search result. The search request may be 
made not only in conventional keywords kxit also in sim- 

25 pie syntactical query statement consists of search item 
and search concfition. Namely, the present invention is 
capable of handling conditional search such as a search 
for a PC of "100.000 yen or below." 
[0044] Unlike structural data structured item by item 

30 such as ROB data, the HTML documents are so called 
semi-structured data in which data is structured in cer- 
tain degree by using tags, even though HTML docu- 
ments are plain text basically For example, data group 
related to one subject such as table, list and clause 

35 involved in HTI^L document may be contained over sev- 
eral HTML documents, or several data groups may be 
contained in a single HTML document. It Is hard to con- 
ditionally retrieve item data corresponding to a given 
item from tiiese data groups. Search engines have 

40 HTML-desaibed input forms that may have fixed search 
entries or search entries that must be filled in for indica- 
tion of search condition. The apparatus of tiie present 
invention is capable of flexikriy coping witii a user's 
search request and provkiing tiie user with a collective 

45 search result. 

[0045] Rg. 4 shows tiie principle of the apparatus of 
tiie present invention. The apparatus 1 has a HTML 
documerrt storing unit 15 for storing meta data about 
HTML documents. The meta data includes the loca- 

so tions, document structures, presentation styles, etc.. of 
tiie HTML documents for each HTML document. The 
locations of the HTML documents are. for example. 
URLs. The document sti-ucture data of tfie HTML docu- 
ments specifies tiie sb-uctures of partial structure such 

55 as tables, lists and clauses contained in the HTML doc- 
uments and is used to map element data in the tables 
and lists to items to be exfapacted. More parttoularly, the 
document structure of a given HTML document indi- 
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cates that data pieces oorresponding to the items to be 
extracted contained in the HTML document are sepa- 
rated from one another with delimiters such as tags and 
slashes. Each field between delimiter such as tag and 
slash in the HTML documents is related to an item and 5 
is managed in table format etc.. by the storing unit 15. 
Data pieces contained in the HTML documents fre- 
quently employ different presentation styles even if they 
fall in the same weaning. The presentation styles stored 
in the storing unit 15 indicate each one of presentation 
style employed by the HTML documents. 
[0046] A user of the apparatus 1 enters a search 
request into a query processing unit 13. The query 
processing unit 13 refers to the meta data stored in the 
HTML document storing unit 15 and specifies the loca- 
tions, document structures, and presentation styles of 
HTML documents related to the search request. The 
query processing unit 13 acquires the HTML docu- 
ments, extract information from the HTML documents 
with the use of the specified meta data, and condition- 
ally processes the extracted information if necessary. 
Therefore, the apparatus 1 provides the user with a col- 
lective search result involved in HTML documents in 
presentation styles that are optimum for the user 
Namely, with a single search request, the user is able to 
collectively receive required information from the HTML 
documents scattering over networks through the appa- 
ratus 1 of the present invention. This improves search 
efficiency and reduces traffic congestion in the net- 
works. 

[0047] in this way. first, the apparatus of the present 
invention manages the structure infomiation of semi- 
structured documents such as HTML documents con- 
nected to open networks and retrieves requested infor- 
mation item by item from plural HTML documents. 
Second, the apparatus of the present invention is capa- 
ble of retrieving necessary information from Web infor- 
mation documents through search engines without 
bothering the user with differences among the search 
methods of various Web sources. 

First embodiment 

[0048] An HTML document information integrated 
retrieval apparatus of the first embodiment according to 
the present invention concerning semi-structured docu- 
ment information retrieval scheme will be explained with 
reference to Figs. 5 to 16. 

[0049] HTML documents are scattering over open net- 
works and have individual document structures, presen- 
tation styles, and partial structures such as tables 
containing different elements. The first embodiment 
retrieves required information involved in various HTML 
documents and provides a user with a collective search 
result in presentation styles that are optimum for the 
user. 

[0050] A concept regarding tiie presentation styles 
and terms used for the embodiments will be explained 



first HTML documents employ different presentation 
styles to express even tiie same meaning. For example, 
tiie price of a product is expressed like "¥1.000," "one 
tiiousand yen." or "1 ,000 yen" depending on tiie writers 
of HTML documents. Temns employed by this specifica- 
tion will be explained. 

[0051] A domain is equal to one presentation style. 
For example. "1 .000 yen" for a price is a witt)-yen pres- 
entation style ttiat forms a domain, and "¥1,000" is a 
with-¥ presentation style thatfomns a domain. 
[0052] A domain group is a collection of domains 
related to the same meaning. For example, prices form 
a domain group, and dates (year, month, day) form a 
domain group. 

[P053] A user input domaai is a domain related to a 
user's search request Input. For example, the with-yen 
presentation style for a price forms a user input domain, 
and the Christian era for a date with T as a delinuter 
forms a user input domain. 

[0054] A user output domain is a domain related to a 
search result for a user. For exanple, tiie with-¥ presen- 
tation style for a price forms a user output domain, and 
an abbreviated date for a date wifli as a delintiter 
forms a user output domain. 
[0055] A user domain covers user input and output 
domains. 

[0056] A local domain is a domain in a given HTML 
document. For example, the with-yen presentatbn style 
for a price forms a local domain. 
0)057] A domain conversion function is a function for 
converting a user input domain into a local domain, or a 
local domain into a user output domain. 
[0058] If different user input domains, user output 
domains, and local domains are involved, ttie difference 
will be resolved by the domain conversion functions. 
Fig. 5 is a block diag-am showing a configuration of 
HTML document infbnnation integrated reb-ieval appa- 
ratus according to the first embodiment. 
[0059] In Rg. 5. the apparatus 1 of tiie first embodi- 
ment has a user interface unit 11 . a syntax analysis unit 
12. a query processing unit 13. an HTML document 
access unit 14, an HTML document meta data storing 
unit 15, and an HTML document meta data managing 
unit 16. The query processing unit 13 has a query item 
finding unit 131 , a query conversion unit 132. a conver- 
sion function library 133. £Ui HTML document process- 
ing unit 134. and a retrieval result conversion unit 135. 
[0060] The user interfece unit 1 1 receives a search 
request (query statement) consisting of search items 
and search conditions entered by a user through an 
application program 3. The syntax analysis unit 1 2 ana- 
lyzes tiie syntax of the query statement received by 
user interface unit 1 1 . The query processing unit 1 3 col- 
lectively retrieves required information items involved in 
HTML documents. More precisely, the query item find- 
ing unit 131 finds locations of items specified in tiie 
query statement The query conversion unit 132 con- 
verts each user input domain in the query statement 
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into a corresponding local domain and torms queries to 
be transmitted from the HTML document access unit 
14. The HTML document access unit 1 4 receives HTML 
documents that are returned in response to the queries. 
The HTML document processing unit 134 acquires 
information from the received HTML documents and 
processes the information according to the query state- 
ment. For example, the HTML document processing 
unit 134 selects information pieces corresponding to the 
search Items, filters the selected Intonation pieces 
according to the search conditions, and provides a 
search result. The retrieval result conversion unit 135 
converts local domains in the retrieval result into user 
output domains. The HTML document access unit 14 
collects HTML documents scattering over open net- 
works and converts infonnation contained in the HTML 
documents into information of a unified form such as a 
table. The HTML document access unit 1 4 is connected 
to HTML document servers 2-1 . 2-2. and the like. Each 
of the Html document servers has HTML documents 
21 and a Web server 22 that manages the HTML docu- 
ments 21. The HTML document meta data storing unit 
15 stores meta data about the HTML documents. The 
meta data includes the document structure, presenta- 
tion styles, items, etc., of each HTML document to be 
retrieved. Items information in a partial structure such 
as a table in a given HTML document frequently disa- 
gree with items stipulated in a search request in a one* 
to-one manner. In this case, the meta data relates the 
plural elements of which each one conesponds to the 
partial structure to the item in a search request Note 
that an element is information piece contained in HTML 
document hereinafter. The HTML document meta data 
manager 16 stores new meta data in the storing unit 15 
and deletes and changes the meta data in the storing 
unit 15. The HTML document meta data manage- 16 is 
implemented in. for example, an editor and is controlled 
by a system manager. 

[0061] Fig. 6 shows the structure of table of the HTML 
document meta data storing unit 15. The HTML docu- 
ment storing unit 15 stores meta data in the form of 
tables. An HTML document table 151 stores the loca- 
tions of HTML documents. An HTML document to tak}le 
mapping tatDle 152 stores data used to convert ele- 
ments contained in the ITTML documents into items 
forming a table. An HTML document item table 153 
stores the attributes of items contained in the HTML 
documents for each item. A domain table 1 54 stores the 
presentation styles of domains. A user domain table 
1 55 stores the input and output domains of each user. A 
domain conversion function table 156 stores domain 
conversion functions. 

[0062] Processing steps carried out by the apparatus 
1 of the first emt>odiment will be explained. The 
processing steps are carried out in two phases, i.e.. a 
preparatory phase of Fig. 7 and an execution phase of 
Fig. 8. In the preparatory phase, a managing person 
prepares meta data about HTML documents through 
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the HTML document meta data manager 16 before 
starting the execution phase. 
[0063] In the preparatory phase of Fig. 7. step S100 
stores the locations of HTML documents in the HTML 

5 document table 151 . Step Si 10 sets, in the HTML doc- 
ument to table mapping table 1 52. data used to convert 
elements contained in the HTML documents into a table 
consisting of items. Step SI 20 sets, in the item table 
153, the attributes of items contained in the HTML doc- 

10 uments. Step SI 30 sets, in the domain table 1 54. bcal 
domains off the items contained in the HTML docu- 
ments. Step S140 sets, in the user domain table 155. 
the input and output domains of each user. Step S145 
checks to see if there are suff blent conversion functions 

15 for converting a given domain into another If not. step 
SI SO prepares necessary domain conversion functions 
and stores them in the domain conversion function table 
156. 

[0064] The execution phase of Fig. 8 will be explained. 

20 In step S200. the syntax analysis unit 12 analyzes the 
syntax of a query statement entered by a user, and the 
query item finding unit 131 finds the locations of search 
items specif ied by the user in the HTML document table 
151. In step S210. the query item finding unit 131 finds 

25 HTML documents that have all of the search items in 
the HTML document item table 153. In step S220, the 
query conversion unit 132 gets user input domains, user 
output domains, and local domains conesponding 
found items from the tables 154 and 155. In step S225, 

30 the query conversion unit 132 checks to see if the user 
input domains and local domains of the search items 
agree with each other. If they do not agree in an item, 
the query conversion unit 132 gets a domain conversion 
function for the item from the domain conversion func- 

35 tion table 1 56 and converts the user input domain of the 
item into a corresponding local domain with respect to 
the items whose domain differs as described above in 
step S230. In step S240. the HTML document process- 
ing unit 134 gets HTML documents through the HTML 

40 document access unit 14, extracts items for the search 
items from the HTML documents, and prepares a 
search result In step 8245. the HTML document 
processing unit 134 checks to see if the user output 
domain and local domain of each item agree with each 

45 other. If they do not agree in an item, the HTML docu- 
ment processing unit 134 gets a domain conversion 
function for the item from the domain conversion func- 
tion table 156 and converts the local domain of the item 
into a corresponding user output domain with respect to 

so the items whose domain differs as described above in 
step S250. In step S260, the search result having 
proper user output domains is supplied to the user 
through the user interiace unit 1 1 . 
[0065] The details of the process procedure of the first 
55 embodiment will be explained with reference to Rgs. 9 
to 16. 

[0066] Figure 9A shows an e)»mplary display on a 
Web browvser of an HTML document concerning with 
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product information of a shop A, and Fig. IDA shows 
that of a shop B. Figure 9B shows an HTML description 
that provides the display of Fig. 9A. and Fig. 10B shows 
an HTML description that provides the display of Fig. 
10A. 

[0067] The shop A employs a tag TABLE to form a 
table to show their product information. The shop B 
employs a lag OL to form a clause of their product infor- 
mation. 

[0068] The shop A displays each price with the with-¥ 
presentation style, and the shop B shows each price 
with the with-yen presentation style. 
[0069] The shop A has a product name as an element, 
and the shop B has a maker name and a product name 
as elements. 

[0070] The location of the product information of the 
shop A is a URL of http:/Affww.shop-a.co.jp/|3rod- 
ucts.html. and that of the shop B is a URL of 
http^/Www.shop-b.co.jp/shouhin.html. 
[0071] In this way, the HTML documents of Figs. 9A 
and 9B have different document structures, presenta- 
tion styles, and elements. 

(1) Preparatory phase 

[0072] Step Si 00 of Fig. 7 sets the locations of the 
HTML documents in the document table 151. In this 
example, the locations are page names and URLs as 
shown in Fig. 11. 

(a) Shop A 

Page name: Shop-A 

URL: http://www.shop-a.co.ipAjroducts.html 

(b) Shop B 

Page name: Shop-B 

URL: http://www.shop-b.co.jp/shouhin.html 

[0073] Step S1 10 sets data for converting elements 
contained in the HTML documents into a table In the 
HTML document to table mapping table 152. In this 
example, page names, record start points, and ways of 
extracting columns 1 to 4 are set as shown in Fig. 12. 
For the prices of the shop B, only numerals and the 
positions including are picked up. 

(a) Shop A 

Page name: Shop-A 

Recoid start: line starting with < TR )< TD ) 
Column 1 : "Shop A** fixed 
Column 2: between Isl <TD>and 1st T In 
record start line 

Column 3: between 1st T and 1st </TD)in 
record start line 

Column 4: between 2nd <TD>and 2nd </TD)in 



record start line 
(b)ShopB 

5 Page name: Shop-B 

Record start: line starting with < LI ) 

Column 1 : ''Shop B" fixed 

Column 2: between 1st (LI) and 1st T in 

record start line 
10 Column 3: between 1st T and 2nd T in record 

start line 

Column 4: between 2nd T and 1st "Ven" in 
record start line 

IS [0074] Step 120 stores the attributes of the items 
involved in the HTML documents in the HTML docu- 
ment Item table 153. In this example, the page names, 
corresponding columns, column titles, and data types 
are stored as shown in Fig. 1 3. Only price information is 

so defined as a numeric value in data type. Values of this 
data type are used for comparison when processing the 
search conditions. 

(a-l) Page Shop-A. column 1 

25 

Page name: Shop-A 
Column: column 1 
Column title: shop name 
Data type: character string 

30 

(a-2) Page Shop-A, column 2 

Page name: Shop-A 
Column: column 2 
35 Column title: maker name 

Data type: character string 

(a-3) Page Shop-A. column 3 

40 Page name: Shop-A 

Column: column 3 
Column title: product name 
Data type: character string 

45 (a-4) Page Shop-A, column 4 

Page name: Shop-A 
Column: column 4 
Column title: price 
so Data type: numeric value 

(b-1) Page Shop-B. column 1 

Page name: Shop-B 
55 Column: column 1 

Column title: shop name 
Data type: character string 
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(b-2) Page Shop-B, column 2 

Page name: Shop-B 
Column; column 2 
Column title: maker name 
Data type: character string 

(b-3) Page Shop-B. column 3 

Page name: Shop-B 
Column: column 3 
Column title: product name 
Data type: character string 

(b-4) Page Shop-B, column 4 

Page name: Shop-B 
Column: column 4 
Column title: price 
Data type: numeric value 

[007S] Step S130 sets local domain names for the ele- 
ments contained in the HTML documents in the domain 
table 154 as shown in Fig. 14. No local domains are set 
for the shop names, maker names, and product names 
of the shops A and B t^ecause they are represented with 
optional character strings. On the other hand, local 
domains for the product prices of the shops A and B are 
set as follows according to the value set in the HTML 
document item labile 1 53. The local domain is registered 
in the HTML document item table 153. 

Domain group: price 

Local domain of Shop-A: with-¥ presentation style 
Local domain of Shop*B: value-comma presenta- 
tion style 

[0076] Step SI 40 sets user Input and output domains 
for each user in the user domain tatDle 155 as shown in 
Fig. 15. A user A enters a shop name, maker name, and 
product name in HIML presentation styles and 
requests a search output in the same presentation 
styles, and therefore, no user input and output domains 
for these items are set For a price domain group, 
assume that the user A requests as follows: 

Input: with-yen presentation style 
Output: with-yen presentation style 

[0077] This domain is registered in the domain table 
154. and the user domain is registered in the user 
domain table 155. The user domain may contain differ- 
ent user input and output domains. 
[0078] Step S1 50 sets domain conversion functions in 
the domain conversion function table 156 as shown in 
Fig. 16. In this example, there are three domains includ- 
ing the value-comma presentation style, with-yen pres- 
entation style, and with-¥ presentation style. 
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Accordingly, mutual conversion functions between the 
user input domains and the local domains arvJ between 
the user output domains and the local domains are set 
as follows and are stored in the domain conversion 
function table 156. These conversion functions are also 
stored in the conversion function library 133. 

(a) Conversion from value-comma presentation 
style into with-yen presentation style 

Conversion function name: Num2YenO 
Conversion input domain: value-comma pres- 
entatton style 

Conversion output domain: with-yen presenta- 
tion style 

(b) Conversion from with-yen presentation style into 
value-comma presentation s^e 

Conversion function name: Yen2NumO 
Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: value-comma pres- 
entation style 

(c) Conversion from value-comma presentation 
s^e into with-¥ presentation style 

Conversion function name: Num2¥0 
Conversion input domain: value-comma pres- 
entatfon style 

Conversion output domain: wlth-¥ presentatfon 
style 

(d) Conversion from with-¥ presentation style into 
value-comma presentation style 

Conversion function name: ¥2NumO 
Conversion input domain: with-¥ presentation 
style 

Conversion output domain: value-comma pres- 
entation style 

(e) Conversion from with-yen presentation style into 
with-¥ presentation style 

Conversion function name: Yen2¥0 
Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: with-¥ presentation 
style 

(f) Conversion from with-¥ presentation style into 
with-yen presentation style 

Conversion function name: ¥2YenO 
Conversion input domain: with-¥ presentation 
style 
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Conversion output donnain: with-yen presenta- 
tion style 

(2) Execution piiase 

[0079] The user A issues a search request consisting 
of. for example, a query statement containing search 
item and search condition: 

Search items: shop name, maker name, product 
name, and price 

Search conditions: price < 200.000 yen 

[0080] The syntax analysis unit 1 2 analyzes the query 
statement entered by the user. In step 8200 of Fig. 8. 
the query item finding unit 131 finds tiie search items. 
The search items are the shop name, maker name, 
product name, and price. The query item finding unit 
131 finds the column titles corresponding to the search 
items in tiie HTML document Item table 153 and pro- 
vides the following records: 

(a) Shop name 

Page Shop-A, column 1 , data type of character 
string 

Page Shop-B. column 1 . data type of character 
string 

(b) Maker name 

Page Shop-A, column 2. data type of character 
string 

Page Shop-B, column 2. data type of character 
string 

(c) Product name 

Page Shop-A. column 3. data type of character 
string 

Page Shop-B, column 3. data type of character 
string 

(d) Price 

Page Shop-A. column 4, data type of numeric 
value 

Page Shop-B. column 4, data type of numeric 
value 

[0081] In step S210, the query item finding unit 131 
finds tiie names of HTML documents that contain all of 
the search items and provides the following two combi- 
nations. The URLs of tiie combinations are obtained 
from the HTML document table 151 . 

(A) Combination 1 



(a) Page name: Shop-A 

(b) Elements 

Shop name: column 1 , character staging 
5 Maker name: column 2, character string 

Product name: column 3. character string 
Price: column 4. numeric value 

(c) URL 

10 htlp*yA(vww.shop-a.co.jp/product8.htifnl 

(B) Combination 2 

(a) Page name: Shop-B 
IS (b) Eements 

Shop name: column 1 , character string 
Maker name: column 2, character string 
Product name: column 3. character string 
20 Price: column 4. numeric value 

(c) URL 

http7Awww.shop-b.cajp^houhin.html 

25 [0082] In step S220. the query conversion unit 132 
acquires user domains and local domains correspond- 
ing to tiie search items. The local domains are obtained 
from the HTML document item XatHe 153. f=or any item 
having a local domain, a domain group is found in tiie 

30 domain table 154. and user domains of tiie same 
domain group are retrieved from the user domain table 
155. As a result, the following combinations are 
obtained: 

35 (A) Combination 1 

(a) Page name: Shop-A 

(b) Elements 

40 Shop name: no local domain 

Maker name: no local domain 
Product name: no kx;al domain 
Price: local domain of witii-¥ presentation 
style 

45 

user input domain of wtfli-yen presen- 
tation style 

user output domain of with-yen pres- 
entation style 

so 

(B) Combination 2 

(a) Page name: Shop-B 

(b) Elements 

55 

Shop name: no local domain 
Maker name: no local domain 
Product name: no local domain 
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Price: local domain of value-comma pres- 
entation style 

user input domain of with-yen presen- 
tation style s 
user output domain of with-yen pres- 
entation style 

[0083] For any item having different user input and 
local domains, the query conversion unit 132 gets a io 
domain conversion function having corresponding con- 
version input and output domains and converts the user 
input domain into a local domain in step S230. In each 
of the atxive-mentioned combinations, the user input 
domain differs from the local domain in the price pres- is 
entation style. Accordingly, proper domain conversion 
functions are fetched from the domain conversion func- 
tion table 156 with the conversion input and output 
domain names serving as keys. 

so 

(A) Combination 1 

Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: with-¥ presentation ss 
style 

Conversion function name: Yen2¥0 

(B) Combination 2 

30 

Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: Value-comma pres- 
entation style 

Conversion function name: Yen2IMumO ^ 

[0084] The conversion functions are executed for the 
combinations 1 and 2 to obtain the following: 

(A) Combination 1 40 
Yen2¥(200,000 yen) = ¥200.000 

(B) Combination 2 

45 

Yen2Num(200.000 yen) = 200.000 

[0085] The query conversion unit 132 generates the 
following queries for the HTML document access unit 
14: 50 

(A) Combination 1 



26 

Search conditions: price < ¥200.000 
(B) Conrtbination 2 

(a) Page name: Shop-B 

(b) Search request 

Search items: shop reme, mater name, 

product name, and price 

Search conditions: price < 200,000 

[0086] With these queries, the HTIVIL document 
access unit 14 acquires the HTML documents and gen- 
erate a search result in step S240. TTie HTML docu- 
ment processing unit 134 extracts information from the 
HTML documents located at obtained URL and linked 
URL according to the HTML document to talsle mapping 
table 152. fitters the information if there are search con- 
ditions, and provides the following search result: 

(A) Combination 1 

(a) f^ge: Shop-A 

(b) Search result 

Shop name: Shop A, maker name: Maker 

A. procfcict name: PCI. price: ¥170.000 
Shop name: Shop A. maker name: Maker 

B. product name: PC101, price: ¥198,000 

(B) Combination 2 

(a) Page: Shop-B 

(b) Search result 

Shop name: Shop B, maker name: Maker 
A, product name: PCI, price: 168,000 

[0087] tf there is any item having different user output 
domain and local domain, the retrieval result conversion 
unit 135 acquires a con^esponding domain conversion 
function and converts the local domain into a proper 
user output domain in step S250. In each of the atxTve- 
mentioned combinations, the local domain and user 
output domain of the price differ from each other, and 
therefore, the retrieval result converston unit 135 
searches the domain conversion function table 1 56 for a 
proper conversion function according to conversion 
input and output domains stored in the domain conver- 
sion function table 156. 

(A) Combination 1 

Conversion input domain: wlth-¥ presentation 
style 

Conversion output domain: with-yen presenta- 
tion style 

Conversion function name: ¥2YenO 



(a) Page name: Shop-A 

(b) Search request ss 

Search items: shop name, maker name, 
product name, and price 
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(B) Combination 2 

Conversion input domain: value-comma pres- 
entation style 

Conversion output domain: with-yen presenta- 
tion style 

Conversion function name: Num2YenO 

[Ot^] The conversion functions are executed to 
ok>tain the following: 

(A) Combination 1 

¥2Yen(¥1 70.000) = 170.000 yen 
¥2 Yen(¥1 98.000) = 198,000 yen 

(B) Combination 2 

Num2Yen(1 68.000) = 1 68.000 yen 

[0089] In the last, the user interface unit 1 1 provides 
the user with the following search result in step S260: 

Shop name: Shop A. maker name: Maker A. prod- 
uct name: PC1 , price: 170.000 yen 
Shop name: Shop A. maker name: Maker B. prod- 
uct name: PCI 01 . price: 198.000 yen 
Shop name: Shop B, maker name: Maker A. prod- 
uct name: PCI. price: 168.000 yen 

[0090] As explained above, the first embodiment man- 
ages meta data about information contained in HTML 
documents scattering over open networks, to realize 
collective search on the information contained in the 
plural HTML documents and generate a search result 
without regard to differences among the HTML docu- 
ments. The first embodiment manages information doc- 
ument by document. If an HTML document to be 
searched is added, corrected, or deleted, the first 
emtx)dlment simply adds, corrects, or deletes the HTML 
document only Itself. The first emtxKiimem easily han- 
dles an exponentially increasing number of HTML doc- 
uments as search objects. 

[0^1] Search result from each HTML document is 
obtained as item data being conditionally processed 
item by item. Therefore, HTML document processing 
unK 134 may merge plural search results from plural 
HTML documents so as to prepare one piece of search 
result, and filter this search result as a whole if neces- 
sary 

[0092] HTML documents scattering over open net- 
works have different document structures, elements, 
presentation styles, etc. Even with these variations, the 
first enAKKliment is capable of retrieving required infor- 
mation from the different HTML documents, converting 
the retrieved informatk)n into a unified form for each 
user, and returns a collective search result to the user. 
Compared with the prior arts, the first embodiment elim- 



inates the time and labor of manual work and drastically 
improves search effkaency. The first embodiment is 
applicable to electronic commerce in flexibly retrieving 
product information with search conditions of, for exam- 
5 pie. the names and prices of shops that offer lowest 
prices for a given product. Consequently, the first 
embocOment contributes to vitalize fair electronic com- 
merce. 

10 Second embodiment 

[0093] An Internet information integrated retrieval 
aiparatus of the second embodiment according to the 
present invention concerning semi-structured document 
75 information retrieval scheme will be explained with refer- 
ence to Figs. 17 to 38. 

[0094] Open networks including the Internet involve 
search engines having specific input forms. The second 
embodiment retrieves necessary information with 

so search conditions from the open networks through plu- 
ral search engines irrespective of differences in the doc- 
ument structures, essential input Items, and 
presentation styles of the search engines and collec- 
tively acquires a search result from tiie search engines. 

25 [0095] The second embodiment employs the same 
concept and terms as the first embodiment. As 
explained above. hlTML documents employ various 
presentation styles depending on their writers and 
users. For example, some HTML documents express 

30 Kanagawa prefecture, an area in Japan, as "Kanagawa- 
ken" and others simply as "Kanagawa." 
[0096] "Kanagawa-ken" is a domain of a with-ken 
presentation style when expressing an area. ''Chinese 
food" is a domain of a witivfood presentation style when 

35 expressing a genre. The area and genre form each a 
domain group. If a user enters a query statement with 
"Kanagawa-ken" and "Chinese food." this query state- 
ment involves user input domains of the with-ken pres- 
entation style for area and with-food presentation style 

40 for genre. If a search output for a user has "Kanagawa- 
ken" and "Chinese food." this search output Includes 
user output domains of the with-ken presentation style 
for area and with-food presentation style for genre. If a 
search result extracted from an HTML document 

45 includes "Kanagawa-ken." this search result involves a 
local domain of tiie with-ken presentation style for area. 
[0097] If a given domain group involves different user 
input domain, user output donraia and local domain, 
the second embodiment resolves the difference by 

so using domain conversion functions like the first embodi- 
ment 

[0098] Rgure 17 shows the Internet information inte- 
grated retrieval apparatus 10 according to the second 
embodiment. This second embodiment is a modification 
55 of the first embodiment to replace the query processing 
unit 13 of Fig. 15 an integrated retrieval unit 130. The 
integrated retrieval unit 130 additionally has an essen- 
tial item finding unit 136, a retrieval pattern judging unit 
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137, and a retrieval result processing unit 138. The 
apparatus 10 has a user interface unit 1 1 . a syntax anal- 
ysis unit 12. the integrated retrieval unit 130. an HTML 
document meta data storing unit 150. an HTML docu- 
ment meta data manager 160. and an HTML document 
access unit 1 4. Tlie integrated retrieval unit 130 accord- 
ing to the second emtxxfiment has a query item f inding 
unit 131. a query conversion unit 132, a conversion 
function likxary 133. the essential item finding unit 136, 
the r^ieval pattern testing unit 137. the retrieval result 
processing unit 138. and a retrieval result conversion 
unit 135. 

[0099] The same parts as those of the first embodi- 
ment shown in Fig. 5 are represented with like reference 
marks if not specifically mentioned, and their explana- 
tions are not repeated. The user interface unit 11 
receives a query statement entered by a user through a 
user application program 3. The query statement con- 
sists of search items and search conditions. The syntax 
analysis unit 12 analyzes the syntax cf the query state- 
ment received by the user interface unit 11. The inte- 
grated retrieval unit 130 collectively retrieves required 
information involved in HTML documents that are man- 
aged by search engines for the search items. More pre- 
cisely the query item finding unit 131 finds the location 
of the search items in HTML documents indicated in the 
query statement. The essential item finding unit 136 
checks scarce items in the input forms of search 
engines and deteriranes search engines to use. The 
retrieval pattern judging unit 137 determines an opti- 
mum search pattern for the query statement and opti- 
mizes the search statement for the search engines 
accordingly. The query corrversion unit 132 converts 
user input domains in the query statement into local 
domains and prepares queries to be transmitted by the 
HTML document access unit 14 to the search engines 
retrieval. The retrieval result processing unK 138 proc- 
esses information contained in the acquired HTML doc- 
uments according to the query statement (e.g.. 
selecting items for search items and filtering data for 
search condition). Ihe retrieval result processing unit 
1 38 filters the Information retracted from the HTML doc- 
uments and suppresses concfitional processes carried 
out by the search engines. The retrieval result conver- 
sion unit 1 35 converts local domains with respect to the 
presentatfon style of retrieved items in the output of the 
retrieval result processing unit 138 Into user output 
domains. The HTML document access unit 1 4 transmits 
the prepared queries to the search engines and 
acquires HTML documents scattering over open net- 
works through the search engines. The second embod- 
iment converts information contained in the acquired 
HTML documents into a unified form such as a table 
appropriate for the user. The HTML document access 
unit 14 is connected to search engines 20-1 , 20-2. and 
the like through a communk;ation network 190. Each of 
the search engines consists of an engine unit 23 and a 
database 24. The HTML document meta data storing 



unit 150 stores information for each search engine such 
as the tocations of the search engines and the docu- 
ment structures, presentation styles, and elements of 
HTML documents. The HTML document meta data 

5 manager 160 adds, deletes, and changes meta data in 
the HTML document storing unit 1 50. The HTML docu- 
ment meta data manager 160 is implemented in. for 
example, an editor, to control the registration and man- 
agement of the meta data in the HTML document stor- 

70 ing unit 150. 

[PIOO] Rg. 18 shows the details of the HTML docu- 
ment meta data storing unit 150. The unit 150 stores 
meta data in the form of tables like the meta data storing 
unit 15 of Fig. 6. An HTML document table 151 stores 

75 the locations of HTML documents. An HTML document 
to table n^apping table 152 stores data for converting 
elements contained in each HTML documerrt into a 
tak)le consisting of items. An HTML document item table 
153 stores the attribute of each item contained in each 

20 HTML document. A domain table 154 stores the pres- 
entation styles of domains. A user domain table 155 
stores the input and output domains of each user. A 
domain conversion function table 156 stores domain 
conversion functions. An essential item table 157 stores 

25 essential input items of the input form of each search 
engine. The retrieval pattem judging unit 137 has a 
retrieval pattern matrix table of Fig. 30 used to deter- 
mine a retrieval pattern for a given search engine and 
optimizes a user query statement for the search engine. 

30 The retrieval pattern matrix table 1 39 of Fig. 30 may be 
stored in the meta data storing unit 150. 
[0101] The details of operation of the apparatus 10 of 
the second emtxxliment and the details of the setting of 
contents for the tables will be explained. The operation 

35 is carried out in two phases, i.e.. a preparatory phase of 
Fig. 21 preparing data such as presentation style before 
retrieval and an execution phase of Rg. 31 . 
[0102] Figs. 19A, 19B. and 19C show exanples of 
input forms of search engines. Figure 20 shows an 

40 HTML description corresponding to the input form of 
Rg. 19B. 

(1) Preparatory phase 

45 p)1 03] Rg. 21 shows steps carried out in the prepara- 
tory phase. Step 8300 sets the HTML document item 
table 153 as shown in Rg. 22. HTML document item 
table 153 manages following items for each input form 
of the search engine. A column "Page name" contains 

50 the names of input forms of the search engines. A col- 
umn titled "Column" contains column numbers related 
to the HTML document mapping table 152. A column 
"Item name" contair^ items contained in the input forms 
of the search engines. A column "Availability" contains 

55 data to indicate whether or not the data items are 
obtainable from the retrieval result of the corresponding 
search engines. A column "Conditional" contains data 
to indicate whether or not the data items are condition- 
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all/ processable by the corresponding search engines. 
A column "Data type" contains data to indicate whether 
each data item is a numeric value or a character string 
and is used when evaluating and tittering inforniation. A 
column "Name tag" contains a NAME-tag if a oorre- 5 
sponding data item employs a selection form. A column 
"Local domain" contains local domains Ibr correspond- 
ing column numbers. 

[0104] Step S310 sets the HTML document table 151 
as shown in Rg. 23. The IfTML document table 151 10 
manages the locations of the input forms of the search 
engines. A cotumn "Page name" contains the names of 
the input forms of the search engines. A column 
"Search engine URL" contains URLs serving as loca- 
tion information off the search engines. is 
[0105] Step S320 sets the HTML document to table 
mapping table 152 as shown in Rg. 24. The HTML doc- 
ument to table mapping table 152 maps Information 
contained in HTML documents returned by the search 
engines to a table. A column "Page name" contains the 20 
names of the input forms of the search engines. A col- 
umn "Record start" contains tags that Indicate each 
start line of contents in a corresponding HTML docu- 
ment. Columns titled "Cotumn 1" to "Column 5" contain 
each tags that indicate a portion corresponding to a 25 
data item to be retrieved in each obtained IHTML docu- 
ment. The column titles "Column 1" to "Column 5" of 
Fig. 24 correspond to the columns 1 to 5 listed in the 
column titled "Column" of the HTML document item 
table 153 for page-A shown in Fig. 22. Step S330 sets 3o 
the domain table 154 as shown in Rg. 25. The domain 
table 154 manages domain groups and the domains as 
local domains information set in the HTML document 
item table 153. 

[0106] Step S340 sets the domain conversion function 3S 
table 156 as shown in Rg. 26. The domain conversfon 
function table 156 manages domain conversion func- 
tions. A column "Conversion function name" contains 
the name of each function for converting a specific 
domain into another domain. A column "[)omaln group" 40 
contains each group off domains of the same kind. A col- 
umn "Conversion input domain" contains each input 
domain for each domain conversion function. A column 
"Conversion output domain" contains each output 
domain for each domain conversion function. A column 45 
"Library name" contains the name of file of the conver- 
sion functfon Ibrary 133. 

[0107] Step 8350 sets the user domain table 155 as 
shown in Rg. 27. The user domain table 155 manages 
the input and output domains Indicated by each user per so 
domain group. A column "User name" contains the 
name of each user that issues a search request A col- 
umn "User input domain" contains user input domains 
used by the users for certain domain group. A column 
"User output domain" contains user output domains ss 
used by the users tor each domain group. 
[0108] Step S360 sets the essential item table 157 as 
shown in Fig. 28. Input form of some search engine has 



essential Items to be filled in. The essential item table 
157 manages such essential items. A column "Page 
name" contains the names of the input torms of the 
search engines. A column "Essential item" contains 
essential Kems that must be fBted in. 

(2) Execution phase 

[0109] Rgure 31 shows steps carried out in the exe- 
cution phase of the second emtxxJiment. 
[01 1 0] For example, a user wants to know the names 
and telephone numbers of Japanese food restaurants In 
Kanagawa prefecture. For this, a search request Is 
made with simple syntax query statement an SQL state- 
ment containing SELECT and WHERE clauses. 
[0111] In step S400, the user interface unit 11 
receives the query statement. The user who made the 
query is the user 1 shown in Fig. 27, and search items 
are "Shop name" and "Phone number" with search con- 
ditions of "area = Yokohama city" and "genre « Japa- 
nese food." The query statement is as follows: 
[01 1 2] SELECT Shop name, phone number WHERE 
area = "Yokofiama dty" and genre = "Japanese food" (1 - 
1) 

[Q1 J3] In step S410. the query item finding unit 131 
refers to the HTML document item table 153 of Fig. 22 
and finds search engines that have the data items cor- 
responding to the search items and conditions. Rgure 
32 shows the search engines thus found. 
[0114] In step S420. the query item finding unit 131 
refers to the document table 1 51 according to the result 
of step S410 and specifies pages that have the items 
"Shop name," "Phone number," "Area," and "Genre." 
Then, the search engines of Page*A, Page-B, and 
Page-C are selected. 

[0115] In step S430, the essential item finding unit 136 
refers to the essential item table 157 of Fig. 28, checks 
the essential items of the search engines, and narrows 
the search engines to be used. Some search engines 
have essential items to be filled in. Thus, among the 
search engines In found location provided l3y step S420, 
the essential item finding unit 136 exclude search 
engine that has essential item except for the indicated 
item as search condition. The query statement (1 -1) has 
the conditional items of "Area" and "Genre." In connec- 
tion with them, the search engine of Page-A has an 
essential input item "Genre" that agrees with the search 
condition item "Genre." Accordingly, the search engine 
of Page-A is adoptable. The search engine of Page-B 
has an essential input item "Area" that con-esponds to 
the search condition item "Area." and therefore, the 
search engine of Page-B is also adoptat)le. The search 
engine of Page-C has essential input items "Area" and 
"Genre." and tiierefore. is adoptable. 
[0116] On the other hand, assume tiiat query state- 
ment as follows is entered: 

SELECT shop name, phone number WHERE area 



45 



so 
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s"YoKDhama city" (1-2) 

[0117] Inthiscase, in the query item finding unite 131 
Page-A, Page-B, Page-C are selected as search engine 
in found location referring to the HTML document item 
table 152. while these three engine have items "shop 
name", "phone number" and "area". 
[0118] Ne)ct, in the essential item finding unit 136 
selected search engines by the query item finding unit 
131 are narrowed as follows. 
[0119] Page-A set genre as essential item. It means 
designation for item 'genre" is essential Ibr retrieval for 
Page-A, so that retrieval from Page-A fails unless genre 
is designated. Genre is not designated in the search 
condition, i.e., where clause in the query statement (1- 
2). accordingly the essential item finding unite 136 
excludes Page-A among candidates. 
[0120] Page-C set both genre and are as essential 
item, so tiiat Page-C is excludes among candidates. 
[0121] On tiie contrary. Page-6 set area as essential 
item, tiie "area" is designated in where clause, so that 
Page-B is selected as a search engine to be retrieved. 
[0122] Note that, when transmitting the above query 
statement (1-2) to a search engine that does not have 
essQttial item, the search engine may be searched 
even if "area" is designated in where clause, as the 
search engine (page) does not handle essential condi- 
tional item. Accordingly, the essential item finding unit 

136 sheets the search engine as a search engine to be 
retrieved. 

[0123] Returning to the query statement (1-1). at this 
time, the following SQL statements according to the 
query statement (1-1) are prepared for the selected 
search engines: 

Page-A: 

SELECT shop name, phone number WHERE 
area = "Yokohama city" and genre « "Japanese 
food" (2-1) 

Page-B: 

SELECT shop name, phone number WHERE 
area = "Yokohama.city" and genre » "Japanese 
food" (2-2) 

Page-C: 

SELECT shop name, phone nuntoer WHERE 
area ^ "Yokohama city" and genre = "Japanese 
food" (2-3) 

[0124] In step 8440. the retrieval pattern judging unit 

137 refers to the retrieval pattern matrix of Fig. 30 and 
determines retrieval methods. The retrieval pattern 
matrix will be explained. Rgure 29 shows a simplified 
relationship between the apparatus of the second 



embodiment and search engines. There are three 
retrieval patterns (a), (b), and (c) for processing a 
search request entered by a user. The pattern (a) 
returns the search request to the user without process- 

5 ing it. The pattern (b) conditionally processes the 
search request by the search engines. The pattern (c) 
processes the search request by the search engines 
and filters tiie process result by tiie apparatus 1 0 of the 
second embodiment. The retrieval pattern matrix of Fig. 

70 30 is used to select one of the three patterns for each 
search item in a given query statement The retrieval 
pattem judging unit 137 refers to the retrieval pattem 
matrix and determines retrieval strategies. In Rg. 30, a 
column "Item" under a title "Search request" contains 

15 each item to retrieve specified by, for example, a 
SELECT clause in an SQL statement A column "Condi- 
tion" under the "Search request" contains each search 
condition specified by. for example, a WHERE clause in 
the SQL statement. A column "Item" under a titie 

20 "Search engine" contains each item returned by a 
search engine as a retrieval result A column "Condi- 
tion" under the "Search engine" contains each condition 
set in a search request and stipulated in the input form 
of each search engine. The column "Item" under the 

25 "Search engine" corresponds to the column "Availabil- 
ity" in tiie HTML document item table 153 of Rg. 22, and 
the column "Condltkm" under the "Search engine" cor- 
responds to the column "Conditional" in the HTML doc- 
ument item table 153. A column "Return as it is" 

30 contains data to indicate whetiier or not a search condi- 
tion value is returned as it is without processing a 
search item. A column "Return from search engine" 
contains data to indicate whether or not a result pro- 
vided by a search engine for a given search item is 

35 returned as it is. A column "Process by search engine" 
contains data to indicate whettier or not a given search 
condition is processed by a search engine. A colunm 
"RItering" contains data to indicate whether or not a 
retrieval result returned from a search engine with 

40 respect to a given search condition is processed by the 
retrieval result processing unit 138 of the apparatus 10. 
[0125] Fbr exanrple. the search statement (1-1) stipu- 
lates "Shop name" witii the SELECT clause but not with 
the WHERE clause The item "Shop name" is "o" in 

45 "Item" and "x" in "Condition" in "Search request" of Rg. 
30. Referring to the HTML document item table 153 of 
Fig. 22. the input form of the search engine Page-A of 
Fig. 19A is capable of receiving "Shop name" as a 
search condition and returning it as a search result. 

50 Accordingly, the search engine of Fig. 19A is "o" in each 
of "Item" and "Condition" in Rg. 30. Namely, "Shop 
name" of the search engine of Rg. 19A conresponds to 
the fourth record from the top of Rg. 30. Accordingly, 
tiie process pattern of the Page-A for "Shop name" 

55 retums infomnation provided by tiie search engine as an 
item witfiout conditionally processing the information 
because a condition is not stipulated in SQL 
[01 26] On the other hand. "Area" is specified in the 
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WHERE clause but not in the SELECT clause in the 
search statement (1-1). Accordingly, "Area" Is V in 
"Item" and "o" in "Condition" in "Search request" of Fig. 
30. According to the HTML document item table 153 of 
Fig. 22, the Page-A of Fig. 19A is unable to receive a 
condition for "Area" but is able to return a search result 
for "Area." Accordingly. "Area" of the Page-A Is "o" in 
"Item" and "x" in "Condition" in "Search engine" of Fig. 
30. As a result, "Area" of the Page-A con^esponds to the 
eighth record from the top of Rg. 30. Namely, the proc- 
ess pattern of the Page-A for "Area" returns no infomria- 
tion because it is not stipulated in the SELECT clause of 
the SQL statement, and the search engine is unable to 
carry out to conditional process. Instead, the retrieval 
result processing unit 138 carries out a filtering process 
to return a retrieval result. Similar processes are carried 
out for the Page-A on "Phone numkier" and "Genre" 
specified in the SQL statement (1-1), to derive a matrix 
of Fig. 33 from the matrix of Fig. 30. 
[0127] Namely, Fig. 33 shows a result of determina- 
tion of items and conditions to be set fbr the Page-A with 
respect to the search request. K is understood from a 
column "Process by search engine" that the search con- 
dition for "Genre" must be transmitted to the Page-A. it 
is understood from a column "Filtering" that a search 
result fbr "Area" from the Page-A must be filtered 
according to the condition set for "Area." It is under- 
stood from a column "Return from search engine" that 
"Shop name" and "Phone number" provided by the 
Page-A must be returned as they are to the user. 
[0128] The Page-A accepts search conditions fbr 
"Shop name" and "Genre," while the query statement 
(1-1) stipulates a search condition only for "Genre." 
Accordingly, "Japanese food" is set fbr "Genre" when 
sending a query to the Page-A. Thereafter, the retrieval 
result processing unit 138 carries out a filtering process 
to select data in the items "Shop name" and "Phone 
nun*er" whose "Area" contains "Yokohama city" and 
prepares a retrieval result. Consequently, the pattem (c) 
is applied to the Page-A, and the query statement (2-1) 
is rewritten as follows: 

Filtering condition: "Area" = "Yokohama city" 
SELECT shop name, phone number WHERE 
genre = "Japanese food" (3-1) 

[0129] Similariy. query statements fbr the Page-B and 
Page-C are prepared. Figure 34 shows a result of 
examination on the Page-B. It is understood from a col- 
umn "Process by search engine" that the search condi- 
tion for "Area" is transmitted to the Page-B. It is 
understood from a column "Filtering" that a search 
result provided by the Page-B is filtered according to the 
condition set for "Genre." It is understood from a column 
"Return from search engine" that information pieces to 
be provided by the Page-B for "Shop name" and "Phone 
nunrtber" are returned as they are to the user. Conse- 
quently, the pattern (c) is applied to the Page-B, and the 



query statement (2-2) is rewritten as follows: 

RItering condition: "Genre" = "Japanese food" 
SELECT shop name, phone number WHERE area 
5 = "Yokohama city" (3-2) 

[01 30] Figure 35 shows a result of examination on the 
Page-C. It is understood from a column "Process by 
search engine" that the search conditions for "Area" and 

10 "Genre" are transmitted to the Page-C. It is understood 
from a column "RItering" that a search result provided 
by the Page-C is not filtered. It Is understood from a col- 
umn "Return from search engine" that information 
pieces to be provided by the Page-C tor "Shop name" 

IS and "Phone nun^er" are returned as they are to the - 
user. Consequently, the pattern (b) is applied to the 
Page-C, and the query statement (2-3) is rewritten as 
follows: 

20 RItering condition: none 

SELECT shop name, phone number WHERE area 
s "Yokohama city" and "Genre" » "Japanese food" 
(3-3) 

25 [01 31 ] In step S450 of Rg. 31 . the query conversion 
urtit 1 32 converts the query statements provided by the 
retrieval pattern judging unit 137 into queries having 
local domains appropriate for the search engines. The 
query conversion unit 132 acquires user input domains 

30 and local domains for items whose local domain is set 
among items in a search engine con^esponding to the 
specified item in search condition from the tables 153 
and 155, as shown in Fig. 36. For each item having dif- 
ferent user Input domain and local domain, the query 

35 conversion unit 132 fetches a proper conversion func- 
tion from the conversion function library 133 according 
to the domain conversfon function table 156 and con- 
v^s the user irput domain into a corresponding local 
domain. For example, the item "Area" in the Page-B has 

40 a local domain of "Page-B-City." A user input domain for 
this domain group Is a domain "With-clty (SHITSUKI)" 
from the tables 154 and 155. Accordingly, the query 
conversion unit 132 refers to the domain conversion 
function table 156, fetches a conversion function 

45 "Shi2ValueB0," and converts "Yokohama city" into "07" 
that indicates the seventh entry in a selection list in the 
input form of the Page-B. 

[0132] The item "Genre" of the Page-C has a local 
domain of "Page-C-Dishes." A user input domain fbr this 

so domain group is a domain Vith-fbod (RYOURITSUKO" 
from the tables 154 and 155. As a result, the query con- 
version unit 132 refers to the domain conversion func- 
tion table 156. fetches a conversion function 
"Ryouri2ValueC0.'' and converts the "Japanese food" 

55 into "1 " that indicates the first entry in a selection list of 
the input form of the Page-C. 
[01 33] At tNs time, the (penes fbr the search engines 
and filtering conditions for the retrieval result processing 
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unit 138 are as follows: 
Page-A: 



description corresponding to the display of Fig. 37A. 
Retrieval results provided by the search engines are as 
follows: 



RItering condition: "Area" = "Yokohama city" 
SELECT shop name, phone number WHERE 
genre = "Japanese food" (4-1 = 3-1) 



Page-B: 



10 



Filtering condition: "Genre" = "Japanese food" 
SELECT shop name, phone number WHERE 
area = "or (4-2) 



(a) Page name: Page-A 

Rttering condition: "Area" « ''\bkohania city" 
Retrieval result: 

Shop name: A1 . Area: \bkohama city 
Phone number: (045) — 
Shop name: A2, Area: \bkosukB city 
Phone number: (0468) (6-1) 



In the statement (4-2), the area "Yokohama 
city" has been changed to "07." 
Page-C: 

SELECT shop name, phone number FROM 
Page-C 

WHERE area = "Yokohama dty" and genre = 

-r (4-3) 

[0134] In the statement (4-3). the genre "Japanese 
food" has been changed to "1 ." 
[0135] in step S470 of Fig. 31 . the HTML document 
access unit 14 issues the following queries specific to 
the search engines according to the query statements 
prepared in step S460. Thereafter, the search engines 
can^y out retrieval processes. 

Page-A: 

Filtering condition: " Area" = "Vbkohama city" 
"GET httpy/Www.Page-a.co.jp/search- 
shop.cgi?category=Japanese food http/1 .0" (5- 
1) 

Page-B: 

RItering condition: "Genre" = "Japanese food" 
"GET htlp://www.Page-b.co.jp/search- 
shop.cgi?areas07 http/1 .0" (5-2) 

Page-C: 

"GET httpV/www. Page-c.co.jp/search- 

8hop.cgi?area=Yokohama city & category=1 
http/1 .0" (5-3) 

[0136] In step 8475, the search engines return data 
retrieved from HTML documents, and the retrieval result 
processing unit 138 extracts necessary information 
therefrom according to the HTML document to table 
mapping table 152. Figure 37A shows a display on a 
browser of the HTML document returned by the search 
engine of the Page-B, and Rg. 37B shows an HTML 



IS (b) Page name: Page-6 

Filtering condition: "Genre" = "Japanese food' 
Retrieval result: 

so Shop name: B1 . Genre: Japanese fbod 

Phone number: 045-***-**** 
Shop name: B2, Genre: Chinese food 
Phone number: 045-***-**** 
Shop name: B3, Genre: Chinese food 

25 Phone number: 045-***-**** (6-2) 

(c) Page name: Page-C 

Filtering condition: none 
30 Retrieval result: 

Shop name: CI , Phone mimber: 045-***- 
**** 

Shop name: C2, Phone number: 045-***- 
35 ****(6-3) 

[0137] In step S480. the retrieval result processing 
unit 138 finds any Item that needs a filtering process 
according to the retrieval pattern matrix of Rg. 30. In 

40 step S490. the retrieval result processing unit 138 car- 
ries out the fitering process on the retrieval result of 
each search englna In the example, the Page-A pays 
no attention to the condition "Area" = "Yokohama city" 
and the Page-B pays no attention to the condition 

45 "Genre" = "Japanese fbod." Accordingly, these retrieval 
results are filtered to extract data that satisfies "Area" = 
"Yokohama city" and "Genre" = "Japanese food" as fol- 
lows: 

so (a) Page name: Page-A 

Filtering result 

Shop name: A1, Phone number: (045) ***■ 
55 ****(7-1) 

(b) Page name: Page-B 
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Filtering result 

Shop name: B1. Phone number: 045-***- 
****(7-2) 

5 

(c) Page name: Page-C 

Filtering result 

Shop name: CI. Phone number: 045-***- io 
**** 

Shop name: C2. Phone number: 045-***- 
*"*(7-3 = 6-3) 



Information Involved in HTML documents retumed from 
plural search engines differ from one another In their 
document structure, presentation style, input form, etc., 
and therefore, search engines return results in various 
ways. The second embodiment resolves these differ- 
ences and provides a user with a search result in an 
integrated form Its difference derives from that of search 
engines. The second embodiment Inproves search effi- 
ciency and reduces traffic in the networks. The second 
embodiment individually registers and manages the 
ir^ut forms of various search engines and easily con- 
trols meta data about HTML documents related to the 
search engines. 



[0138] In step S500, the retrieval result conversfon 
unit 135 acquires the user output domains and local 
domains for the specified search items whose local 
domain is stipulated from the tables 153, 154 and 155, 
as shown in Fig. 38. For any item having different user 
output domain and local domain, the retrieval result 
conversion unit 135 converts the local domain into a 
coH'esponding user output domain according to a con- 
version function fetched from the domain conversion 
function table 156. For example, the item "Phone 
number" of the Page-A lias a local domain and a user 
output domain that are identical to each other, and 
therefore, no conversion is carried out. The item "Phone 
number" of each of the Page-B and Page-C has a local 
domain "Tel-Bar** and a user output domain "Tel-Paren." 
As a result, the retrieval residt conversion unit 135 
fetches a conversion function "Bar2ParenO*' from the 
domain conversion function iakAe 156 to convert "045- 
into "(045) The local domains of Page- 

B and Page-C are converted into user output domains 
as follows: 

Input: -045-***-****" (Domain: Tel-Bar) 
Domain conversion function: Bar2ParenO 
Output: "(045) (Domain: Tel-Paren) 

[0139] In step S510, the user Inteif^ unit 1 1 retums 
an collective search result prepared from above men- 
tioned retrieval result mentioned betow to the user, and 
the application program 3 of the user displays the result 
in the form of, for example, a table. 

Shop name: A1, Phone number: (045) ***-**** 
Shop name: B1 , Phone number: (045) ***-**** 
Shop name: CI, Phone number: (045) ******** 
Shop name: C2, Phone nuni>er: (045) ***-**** 

[0140] As explained above, the second embodiment 
prepares search requests for a plurality of search 
engines scattering over open networks by Individually 
managing the objects of the input forms of the search 
engines, thereby resolving differences among the inter- 
face of the search engines and flexibly retrieving neces- 
sary information through the search engines. 



75 Third embodiment 

[0141] An HTML document information extraction 
apparatus of the third embodiment according to the 
present invention concerning semi-structured document 
so information retrieval scheme will be explained with refer- 
ence to Figs. 39 to 53. 

[0142] The third embodiment retrieves information 
item by item from HTML documents scattering over 
open networks. This third enfoodiment is a nKXIification 

25 off the first emtxxJiment to form the HTML document 
processing unit 134 of the first embodiment of Rg. 5 
with a template analysis unit 1341. a URL-template 
table 1342. and a template processing unit 1343. The 
arrangement of Fig. 39 may singularly be achieved or 

30 may properly be oomfcMned with the arrangements of the 
first and second embodiments. For example, the 
arrangement of Fig. 39 may have the syntax analysis 
unit 12, item finding unit 131, query conversion unit 132, 
HTML document meta data storing unit 15. HTML doc- 

35 ument meta data manager 1 6. etc., of Rgs. 5 and 1 7. 
[0143] To extract information item by item from HTML 
documents, the third embodiment manages the loca- 
tions and document structures of HTML documents for 
each HTML document. More precisely, the thinj embod- 

40 iment manages the locations of HTML documents by 
using URLs of the HTML documents. Its proxy Informa- 
tion may be managed by using a proxy setting file 141 
that stores proxy server names and proxy port nun^ers 
related to the HTML documents. The document struc- 

45 tures of HTML documents include information of partial 
structures such as tables, lists €ukI clauses contained in 
the HTML documents, that is. items to be extracted are 
delimited by delimiters such as tags and slashes, for 
exanple. The document structure information includes 

50 the attributes of columns and data types for each items. 
The third embodiment stores and manages these docu- 
ment structures of HTML documents as item name, 
extraction text specifying part and data type of the item 
name etc., in template files 1345. The data type of a 

55 given item may be a character or a numeric value and is 
used when processing data related to the item. The 
URL-template table 1342 relates the template files 1345 
to the URLs or file names of HTML documents to be 
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searched. Each HTML document is converted into a 
unified form such as a table according to extraction text 
specifying parts of a corresponding template file. The 
template files 1345 coaespond to the HTML document 
to table mapping table 152 and HTML document item 
table 153 of Figs. 6 and 18. 

[0144] When a user specifies a URL or a file name, 
the third embodiment refers to the proxy setting file 141 , 
URL-template table 1342» and template files 1345. For 
example, if a user specifies a URL. the third embodi- 
ment refers to the proxy setting file 1 41 to acquire a cor- 
responding HTML document name, refers to the URL- 
template table 1342 to acquire a template file name, 
scans the acquired HTML document one line or plural 
lines at a time from the top thereof, compares the 
scanned contents with extraction text specifying parts of 
the template file 1345. and extracts Information item by 
item accordingly. At this time, the third embodiment 
checks to see if there is a link to the next page in the 
template file 1345. If there is. the third embodiment 
acquires the URL or file name of the next page and 
extracts data from the page. The third embodiment 
repeats these operations to completely read links. The 
third emkxxjiment maps the extracted information to a 
table item by item by item watching refemng to the tenn- 
plate file 1 345. shapes the infonnation according to data 
types stipulated in the tenrplate file 1345. and returns 
the names of the items from which the information has 
been extracted and the shaped and itemized informa- 
tion to the user. Unlite the prior arts, the tiird embodi- 
ment optionally defines the data types of elements 
(infonnation pieces) extracted from HTML documents 
so that conditionally processes the information pieces 
according to search conditions. Similar to the first and 
second enrtbodiments, the third embodiment is capable 
of processing the presentation styles of infonnation 
according to a user's request. 
[0145] Rg. 39 is a block diagram showing the HTML 
document information extraction apparatus according to 
the third enlbodiment 

[0146] In Fig. 39, tiie apparatus 100 of the third 
embodiment has a user interface unit 1 1 . an HTML doc- 
ument access unit 14, tiie proxy setting file 141. an 
HTML document processing unit 1 34. the tenplate files 
1345, and a retrieval result conversion unit 135. The 
HTML document processing unit 134 has the template 
analysis unit 1341. URL-template table 1342, and tem- 
plate processing unit 1343. A user enters a query state- 
ment 301 through an application program 3. According 
to the query statement 301, the apparatus 100 
accesses HTML documents directiy or through a proxy 
server 2, acquires information from the HTML docu- 
ments, processes the information according to template 
files 1345. and returns a search result 302 to the user. 
[0147] HTML documents are scattering over networks 
and have different locations, tags, and information ele- 
ments. To cope with tiiese differences and extract infor- 
mation item by item from tiiem. the apparatis 100 



individually manages tiie locations and document struc- 
tures of the HTML documents for each HTML docu- 
ment. In addition, the apparatus 100 provides a search 
result in a unified form such as a table. 

5 [pl481 The user interface unit 1 1 receives the query 
statement 301 entered by the user through the applica- 
tion program 3 and transmits it to the HTML document 
access unit 14. According to a URL or a file name pro- 
vided by the user interface unit 1 1 . tiie HTML document 

10 access unit 14 refers to the proxy setting file 141 and 
acquires an HTML document (4-1 . 4-2). "Rie HTML doc- 
ument Is transferred to tiie template analysis unit 1341 . 
If the HTML document contains link data, tiie template 
analysis unit 1341 extracts linked URLs according to 

15 whteh the HTML document access unit 1 4 refers to the 
proxy setting file 141 if necessary and acquires HTML 
documents (4-1 . 4-2) having the linked URLs. Rgure 41 
shows an example of the proxy setting file 141 that 
specifies proxy server names and proxy port numbers, 

20 ttiat Is, the k)catk>n data of proxy server necessary for 
acquiring HTML documents and is referred by the 
HTML document access unit 14. Figure 42 shows an 
example of one of tiie template files 1345 that specifies 
parts that are extractable as items and items to be 

25 extracted in extraction text specifying parts. The tem- 
plate file also specifies data types of tiie items to be 
extracted. The template files 1345 are referred by the 
template analysis unit 1341. The URL-template table 
1342 shown in Fig. 43 manages relationships between 

30 URLs or file names and template files and is refenred by 
the template analysts unit 1341. The template analysis 
unit 1341 fetches the name of a template file corre- 
sponding to the query statement 301 from the URL-tem- 
plate table 1342. At the same time, the template 

35 analysis unit 1341 refers to the template file 1345fortiie 
acquired name of the tenrplate file and analyzes and 
acquires extractable parts, items to be extracted, and 
data types of the items to be extracted of the HTML doc- 
ument in query. The acquired data is transfen-ed from 

40 the template analysis unit 1 341 to the template process- 
ing unit 1343. The template analysis unit 1341 also 
detemrnnes whether or not there are linked URLs in the 
template file 1345. If there are linked URLs, they are 
transferred to the HTML document access unit 14. 

45 which acquires linked HTML documents accordingly. 
According to the extractable parts, tiie items to be 
extracted, and the data types of ttte items to be 
extracted from the template analysis unit 1341 . tiie tem- 
plate processing unit 1343 extracts item data from the 

50 HTML documents. The retrieval resuft conversion unit 
135 receives the extracted Information and the data 
types thereof from the template processing unit 1343 
and carries out conversion on the extracted Information 
according to the data types. The converted infbrnf>ation 

55 is sent as a search result 302 to the user through the 
user interface unit 11. 

[0149] The apparatus 100 of the third embodiment, or 
any one of the apparatuses of the first and second 
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embodiments, may be realized with a computer liaving 
a CPU. memories. I/O devices, external storage 
devices, etc.. and a medium for recording a program 
that provides the functions of the present invention 
when being read by the computer. 5 
[0150] The proxy server 2 ads as an intemiediary to 
acquire HTML document specifiable by the apparatus 
100 and returns an HTMLdocument (4-1, 4-2) specified 
by an URL to the apparatus 1 00. The HTML documents 
4-1 and 4-2 are tagged text file constituting home pages 10 
scattering over open networks. The application program 
3 receives from a user a search request at least contain- 
ing a URL or file name and search items, gets a search 
result for the search request from the apparatus 100, 
and provides the user with the search result. is 
[0151] Processing steps carried out by the apparatus 
100 of the third embodiment will be explained. The 
steps are carried out in a preparatory phase preparing 
data such as presentation style before retrieval of Fig. 
40 and an execution phase of Rg. 44. The preparatory 20 
phase of Fig. 40 is prepared by a managing person with 
the use of, for example, an editor but not by operating 
the whole of the apparatus 100. 

(1) Preparatory phase 25 

[0152] The preparatory phase of Rg. 40 will be 
explained. Step S605 sets a proxy server name and a 
proxy port number to form the proxy setting file 141 of 
Fig. 41 , If proxy server needed (S600Y). Step S610 pre- 30 
pares a template file. The tenrplate file has a unique 
name among all template files and contains the follow- 
ing data (Fig. 42): 

(a) Items to be extracted 35 

[0153] In formation about items to be extracted corre- 
sponds to keyword "Word** 

[01 54] The template file stipulates the names of items 
from which information pieces are extracted, the data 40 
types of the items, and fixed values added to the items. 
In the exan^e of Fig. 42, the data type is "1 ** to indicate 
a character type: Note that the data type may be set 
according to desired filtering processing such as "3** for 
a numeric value type, or "4" for a character string adding 45 
type. The template file of Fig. 42 includes a linked 
address (URLs relative path) at the portion headed 
"Next URL** These pieces of data type and fixed value 
are needed when adding or deleting information with 
respect to a search result to be returned to a user. so 

(b) Text extraction specifying part 

[0155] Information about text to be extracted corre- 
sponds to the portion headed "HTML Template" ss 
[0156] A record that contains infomiation to be 
extracted is copied from a target HTML document (Web 
page). A required information part is replaced with 



"Sitem name$" and each part in the record that can be 

omitted is replaced with an omit mark 

[0157] Ha given for HTML document includes partial 

structure to be handled as character string specifying 

the end of same tables are set. In the example of Rg. 

42, there are first, second and third tables and related 

items. 

[0158] If there is any linked URL, character string for 
specifying the iintod URL are set. Thereafter, step S620 
prepares the URL-template table 1342 containing URLs 
or file names and con'espontBng template file names, as 
shown in Rg. 43. 

(2) Execution phase 

(P159] Rgure 44 shows steps in the execution phase 
for extracting information from items of a given HTML 
document according .to the third emtxxliment. 
[0160] In step S700, the user interface unit 11 
receives a query statement entered by a user through 
the application program 3. The query statement 
includes a URL or a file name arxi search items. If the 
query statement include a URL, the HTML document 
access unit 14 refers to the proxy setting file 141 if the 
corresponding file 141 is defined (4-1) and acquires an 
HTML document having the URL. If the query statement 
contains afile name, a local HTMLdocument having the 
file name is specified. According to the URL or file name 
and the contents of the proxy setting file 141 . the HTML 
document access unit 14 acquires an HTML document 
directly or through the proxy server 2 and receives a 
con-esponding HTMLdocument in step S710. 
[0161] In step 8720. the template analysis unit 1341 
checks to see if there is a template file 1345 conre- 
sponding to the URL Namely, the template analysis unit 
1341 searches the URL-template table 1342 for the 
URL or file name stipulated in the query statement. If 
there is no corresponding template file (Step S720N). 
the template analysis unit 1 341 sends an error message 
to the user internee unit 1 1 . If there is a corresponding 
template file, the template analysis unit 1341 fetches 
the template file from among the template files 1345, 
analyzes extraction rules stipulated in the template file, 
and transfers the extraction rules to the template 
processing unit 1343, in step S730. 
[0162] In step S740. the template processing unit 
1343 extracts information item by item from the HTML 
document (4-1, 4-2) according to the extraction rules 
obtained from the terrplate file 1345 and stores the 
extracted information in a table. In step 8750, the tem- 
plate processing unit 1343 analyzes the extraction rules 
and determines whetiier or not there is a linked URL. If 
there is (Step S750Y). the template processing unit 
1343 transfers the linked URL to the HTML document 
access unit 1 4, which acquires an HTML document hav- 
ing the linked URL The acquired HTML document with 
the linked URL is 8id)jected to the steps S730 to 8750. 
[P163] The retrieval result conversk)n unit 135 refers 
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to the template file 1345 to carry out the following proc- 
esses on the extracted items of information: 

a) executing no processes on item data whose data 
type are ruled to display information as It is; 

b) returning fixed values from the retrieval result 
conversion unit 135 for Items whose data type are 
ruled to have the fixed values even if the HTML doc- 
unr^ent contains no conresponding information; 

c) deleting commas from numeric values for item 
data whose data type are ruled to do so; and 

d) adding fixed values such as relative URL paths to 
item data whose data type are ruled to have such 
additional values. 

[01 64] According to these pieces of data, the retrieval 
result conversion unit 135 prepares a search result and 
transmits it to the application program 3 through the 
user interface unit 1 1 . 

[0165] Figures 45 to 48 show examples of extracting 
information item by item according to the third embodi- 
ment, in which Fig. 45 is a display of an HTML docu- 
ment on a Web browser. Fig. 46 is a part of HTML 
description con'esponding to the display of Fig. 45, and 
Fig. 47 shows a template file for extracting information 
item by item from the HTML document of Figs. 45 and 
46. The template file Includes items to be extracted. I.e.. 
"racename," "grade." "circle." "mmdd," "distance." "con- 
dition," "time." •^tfinhorse." "sex_age." "jockey." "leki 
(trainer)." and "uri." The template file also includes a text 
extraction specifying part for extracting these items. Fig- 
ure 4S stiows an example of information extraction from 
the HTML document of Figs. 45 and 46 according to the 
template file of Fig. 47. This example is based on that 
the application program 3 specifies or selects "jockey." 
"winhorse," and "racename" as search items. 
[0166] Figs. 42. 49 to 52 show a modification of the 
third embodiment The template file of Fig. 42 of the 
third embodiment contains the first and second tables 
that are partial structures consisting of the same ele- 
ments for the same HTML document. Here, the partial 
structure is data group related to one subject such as 
table, list and dause. On the other hand, the modifica- 
tion extracts required information item by item by 
employing a template file that contains items having dif- 
ferent attributes for the same HTML document, or a 
template file that contains partial structures having dif- 
ferent elements for the same HTML document, or a tem- 
plate file that is applicable for ari HTML document 
including link infbnmation. 

[0167] Figs. 49 and 50 show examples of displays on 
a Web browser of HTML documents showing shop 
information. These HTML documents have each three 
tables having same structures. Figure 51 shows an 
HTML description con'esponding to the HTML docu- 
ment of Fig. 49. and Fig. 52 shows an HTML description 
corresponding to the HTML document of Fig. 50. Fig. 42 
shows a template file for extracting vifbrmation item by 



item from the HTML documents of Figs. 49 to 52. The 
template file of Rg. 42 contains "TaUeEndDelimiter** to 
Indicate the end of a partial structure such as a tattle, list 
or a clause, the names of items to be extracted in words. 

5 data types of the items in words, and a text extraction 
specifying part "HtmlTemplate." For example. TableEnd- 
Delimiter s < /TABLE) indicates that an appearance of 
(/TABLE )specifies the end of a partial structure. 
[0168] ( A HREF = "./html_2.htmr >in Fig. 51 indicates 

10 a link to the HTML document of Fig. 52. The template 
analysis unit 1341 analyzes this link Information. 
According to the link infonnation and "NextURL" in the 
template file of Fig. 42. the tenplate processing unit 
1 343 extracts information not only from the items of the 

15 HTML document of Fig. 49 txjt also from the items of the 
HTML document of Fig. 50. 

[01 69] Rrst and second tables in the HTML descrip- 
tion of Fig. 51 are two partial structures having the same 
document structure and the same data types. According 

20 to the descriptions about the first and second structures 
in the template fOe of Fig. 42. the template processing 
unit 1343 extracts item data in the partial structures hav- 
ing the same strix^ture in the same HTML document. 
The HTML description of Fig. 52 has the same structure 

25 as that of Fig. 51 . and therefore, information is extracted 
Hem by item therefrom according to the template f 9e of 
Rg. 42. 

[01 70] The first and second tables in the HTML docu- 
ment of Rg. 51 are two partial structures having differ- 
so ent attributes, in particular, presentation attribute. 
Among information pieces in an item *X3enre" in the 
HTML document of Rg. 51 . some are delimited with < i > 
and (/Dand some are not. The tag "/I" indicates to dis- 
play a corresponding information piece in italic. A tag 
35 "/B" indk^ates to display a corresponding information 
piece in bold. In the ten^ate file of Fig. 42, these infor- 
mation for different attrftnites are defined with two 
descriptions, which are applied to one line of a conre- 
sponding partial structure of the HTML documents. If a 
40 given HTML document agrees with one off the descrip- 
tions, item Infbmnatlon Is extracted from corresponding 
the HTML document In Rg. 42. an omission tag is 
used for the item "Qenre" to extract Information pieces 
from the item without regard to the presentation 
45 attribute thereof. 

[0171] In Fig. 51, a third table is a partial structure 
having an element **Evaluation" that is not in the first and 
second tables. A description about the third table in Rg. 
53 enables the template processing unit 1343 to extract 
so partial structure having different elements in the same 
HTML document- 

[0172] As explained above, the third embodiment 
manages data about information contained in plural 
KITML documents, extracts information Item by item 
55 from the HTML documents according to the data, and 
provides a user with required informatkHi in a unified 
form such as a table. The third embodiment prepares a 
text extraction specifying part to specify mere items 
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from which information must be extracted according to a 
user's request, thereby making the fbmiation and main- 
tenance of the retrieval system easier. The third embod- 
iment retrieves information item by item from HTML 
documents scattering over open networl<s without 
regard to varying interfaces attached to the HTML doc- 
uments, and provides each user with required Informa- 
tion in a required form. 

[0173] The third emtxxiiment enplo/s template files 
that are Independent of HTML syntax rules, to extract 
required information item by item from HTML docu- 
ments, if the HTML documents have items delimited 
with, for example, tags. The third ^nlxxJiment extracts 
information item by item from HTML documents only by 
preparing template files that define the items from which 
information is extracted. The template f Ses can easily be 
prepared according to target HTML documents and are 
visually understandatDle. Consequently, the third 
embodiment easily and flexibly extracts information item 
by item from HTML documents. 
[0174] It is to be noted that, besides those already 
mentioned above, many modifications and variations of 
the above embodiments may be made without depart- 
ing from the novel and advantageous features of the 
present invention. Accordingly, all such modifications 
and variations are intended to be included within the 
scope of the appended claims. 

Claims 

1 . An apparatus for retrieving data contained in a plu- 
rality of senru-structured documents over open net- 
works, comprising: 

a unit (15) for storing meta data for each of the 
semi-structured documents, the meta data 
inducfing items to be extracted from the semi- 
structured documents and item data used to 
conditionally retrieve the items; 
a unit (13) for retrieving data scattered among 
the semi-structured documents for entered 
query according to the meta data, and prepar- 
ing a collective search result; and 
a unit (11) for outputting the search result in a 
presaik>ed single format that is specific to each 
user. 

2. An apparatus for retrieving data contained in a plu- 
rality of semi-etmctured documents over open net- 
works, comprising: 

(a) a unit (1 5) for storing location data about the 
location of each of the semi-structured docu- 
ments, document structure data about the 
structure of each of the semi-structured docu- 
ments, used to delimit document into items to 
be extracted, attribute data about the attrikxjtes 
of each of the items to be extracted, used to 



conditionally retrieve the items, and style con- 
version data used to convert item presentation 
styles .of the user and item presentation styles 
of the semi-structured documents from one 
5 into another; 

(b) a unit (131) for finding, according to the 
location data, the location of a semi-structured 
document that contains all search Items speci- 
fied in an entered query that consists of the 

10 search items and search conditions; 

(c) a unit (1 32) for converting, if necessary, item 
presentation styles of the entered query into 
item presentation styles of the search item in 
location found semi-structured documents 

IS according to the style conversion data, and 

fonrong queries for the location found semi- 
structured documents; 

(d) a unit (14) for transmitting the queries pro- 
vided by the unit (c) to the found locations and 

20 acquiring the semi-structured documents; 

(e) a unit (134) for extracting item data from the 
acquired semi-structured documents accord- 
ing to the document structure data, selecting 
the extracted item data, if necessary, according 

25 to the attribute data for the search oorKlition. 

and preparing a search result; and 

(f) a unit (135) for converting, if necessary, item 
presentation styles of the search result into the 
item presentation styles of each user according 

30 to the style conversion data. 

3. The apparatus of daim 2. further comprising: 

(g) a unit (1345) for storing, for each of the 
35 semi-structured documents, a template that 

stq3ulates at least item name to be extracted 
and prescribed text extraction style data of item 
group to be extracted according to tire docu- 
ment structure data. 

40 whereiri the unit (e) compares the acquired 

semi-structured document with corresponding 
templates t>y scanning the acquired semi- 
structured document; and 
extracts item data of the items watching the text 

45 extraction style data of the template so as to 

preparing ttie search result 

4. The apparatus of claim 3, wherein: 

so the unit (e) shapes the search result into a 

table. 

5. The apparatus of daim 3. wherein, if the text extrac- 
tion style data of a given template indudes link data 

55 to anotiier semi-structured document, : 

the unit (e) scans a linked semi-structured doc- 
ument and compares the lintod semi-struc- 
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tured document with the template. 

6. The apparatus of daim 3, wherein: 

any template that is for a semi-structured docu- s 
ment having a plurality of partial structures of 
the same structure contains text extraction 
style data for each of the partial structures: and 
the unit (e) extracts the item data so as to pre- 
pare the search result for each of the partial io 
structures. 

7. The apparatus of daim 3, wherein: 

the template contains a plurality pieces of text is 
extraction style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the unit (e) extracts item data of the matching 20 
the text extractton style data, by scanning the 
acquired semi-structured document, when the 
partial structure of the semi-structured docu- 
ment match any one piece of the text extraction 
style data. 2s 

8. The apparatus of daim 3. wherein: 

any template that is for a semi-structured docu- 
ment having a plurality of partial structures 30 
containing mutually different elements contains 
text extraction style data for each of the partial 
structures; and 

ttie unit (e) extracts tiie item data so as to pre- 
pare the search result for each of the partial 3s 
structures. 

9. An apparatus for reti^ieving data through search 
engines over open networks, comprising: 

40 

(aa) a unit (150) for storing location data about 
ttie location of eadi search engine, essential 
input item data specifying essential input items 
required by an input form of each search 
engine, document structure data about the 45 
structure of each HTML document, used to 
delimit document into items to be extracted, 
attribute data about the attributes of the items 
to be extracted, used to conditionally retrieve 
the items, and style conversion data used to so 
convert item presentation styles of a user and 
item presentation styles of each HTML docu- 
ment from one into another; 
(bb) a unit (131) tor finding, according to the 
location data, tiie location of a search engine 55 
tiiat contains all search items specified in an 
entered query that consists of tiie search items 
and search conditions; 



(cc) a unit (136) for selecting, according to tiie 
essential input item data, search engine to be 
searched from among ttie location found 
search engines, the search engine of which tiie 
essential input item satisfy tiie specified search 
condition; 

(dd) a unit (137) for delennirwng an optimum 
retrieval pattern for each of the selected search 
engines according to a matrix table and con- 
verting the entered query into queries tor tiie 
selected search engines accordingly, tiie 
matrix table defining combination between tiie 
search items and search conditions and tiie 
items and essential input items of each search 
engine; 

(ee) a unit (132) for converting, if necessary, 
item presentation styles of tiie queries provided 
by tiie unit (dd) into item presentation styles of 
the search item in selected search engines 
according to the style conversion data; 
(ff) a unit (14) for transmitting the queries pro- 
vided by tiie unit (ee) to tiie found locations and 
acquiring HTML documents; 
(gg) a unit (138) tor extracting item data from 
the acquired HTML document sending as a f irst 
search result according to ttie structure data, 
selecting ttie extracted item data, if necessary, 
according to the attribute data for the search 
condition on tiie basis of corresponding 
retrieval pattern and preparing a second 
search result; and 

(hh) a unit (135) for converting, if necessary, 
item presentation styles of ttie second search 
result into item presentation styles of each user 
according to the style conversion data. 

. The apparatus of daim 9. furttier comprising: 

(ii) a unit (1345) for storing, for each HTMLdoc- 
ument, a template that stipulates at least Item 
name to be extracted and prescribed text 
extraction style data of Item group to be 
extracted according to the document structure 
data, 

wherein ttie unit (gg) compares ttie acquired 
HTML document witti corresponding the tem- 
plate t)y scanning the acquired HTML docu- 
ment serving as ttie first search result; and 
extracts item data of tiie items matching tiie 
text extraction style data of tiie template so as 
to prepare ttie second search result. 

I. Theapparatusof claim 10. wherein: 

the unit (gg) shapes the search result into a 
table. 

I The apparatus of daim 10. wherein, if tiie text 
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extraction style data of a given template includes 
link data to another HTML document.: 

the unit (gg) scans a linked HTML document 
and compares the linked HTML document with s 
the template. 

13. The apparatus of daim 10, wherein: 

any template that is for an HTML document io 
having a plurality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and 
the unit (gg) extracts the item data so as to pre- 
pare the search result for each of the partial is 
structures. 

14. The apparatus of daim 10, wherein: 

the template contains a plurality pieces of text 20 
extractbn style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the unit (gg) extracts item data of the items 2S 
matching the text extraction style data, by 
scanning the acquired HTML document when 
the partial structure of the HTML document 
match any one piece of the text extraction style 
data. 30 

15. The apparatus of daim 10, wherein: 

any template that is for an HTML document 
having a plurality of partial structures contain- 35 
ing mutually different elements contains text 
extraction style data for each of the partial 
structures; and 

the unit (gg) extracts the item data so as to pre- 
pare the search result tor each of the partial 40 
structures. 

16. An apparatus for extracting data item by item from 
arbitrary HTML document over open networks, 
comprising: 45 

(aaa) a unit (1345) for storing a template for 
each HTML document according to document 
structure data about the structure of the HTML 
document used to delimit document into items so 
to be extraded. the template stipulating at least 
item name to be extraded and prescribed text 
extraction style data of item group to be 
extracted from the HTML document; 
(bbb) a unit (1341 ) for analyzing a template cor- ss 
responding to acquired HTML document; and 
(ccc) a unit (1343) for comparing the acquired 
HTML documents with corresponding template 



by scanning the acquired HTML document, 
and extracting Item data of the items matching 
the text extraction style data of the template, so 
as to prepare a search result. 

17. The apparatus of claim 16. wherein: 

the unit (ccc) shapes the search result into a 
table. 

1& The apparatus of daim 16, wherein, if the text 
extractton style data of a given template nndudes 
link data to another HTML document: 

the unit (ccc) scans a linked HTT^L document 
and compares the linked, HTML document with 
the template. 

19. The apparatus of claim 16, wherein: 

any template that is for an HTML document 
having a plurality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and 
the unit (ccc) extracts the item data so as to 
prepare the search result for each of the partial 
stnjdures. 

20. The apparatus of claim 16. wherein: 

the template contains a plurafity pieces of 
extractkm text style data for each of partial 

strudures, the text extraction style data being 
used for filtering uneven parts contained in the 
partial strudure; and 

the unit (ccc) extracts item data of the items 
matching the extraction style data, by scanning 
the acquired HTML document, when the partial 
strudure of the HTML document match any 
one piece of the retraction text style data. 

21. The apparatus of daim 16, wherein: 

any template that is for an HTML document 
having a plurality of partial structures contain- 
ing mutually different elements contains text 
extractk>n style data for each of the partial 
strudures; and 

the unit (ccc) extracts the item data so as to 
prepare the search result for each of the partial 
strudures. 

22. A method of retrieving data contained in a plurality 
of semi-structured documents over open networks, 
comprising the steps of: 

retrieving data scattered among semi-struc- 
tured documents for entered query according 
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to meta data about each of the semi-structured 
documents and preparing a coilective search 
result, the meta data including items to be 
extracted from the semi-structured documents 
and item data used to conditionally retrieve the 5 
items; and 

outputting the search result in a prescribed sin- 
gle fbnmat that is specific each the user. 

. 23. A method of retrieving data contained in a plural'tty 10 
of semi-stmctured documents over open networks, 
comprising the steps of: 

(a) finding, according to location data that 
specifies the location of each of the semi-struc- is 
tured documents, the location of a semi-struc- 
tured document that contains all search items 
specified in an entered that consists of the 
search items and search conditions (s210); 

(b) converting, if necessary, item presentation so 
styles of the entered query into item presenta- 
tion styles of the search item in location found 
semi-structured documents according to style 
conversion data and forming queries for the 
location found smi-structured documents, the 25 
style conversion data being used to convert 

. item presentation styles of a user and item 
presentation styles of the semi-stmctured doc- 
uments from one into another (S220.S230); 

(c) transmitting the queries provided by the 30 
step (b) to the found locations and acquiring 
the semi-structured documents (S240); 

(d) extracting item data from the acquired semi- 
structured documents according to document 
structure data, selecting the extracted item 3s 
data, if necessary, according to attribute data 

for the search condition and preparing a search 
result, the document structure data specifying 
the structure of each of the semi-structured 
documents and being used to delimit document 40 
into items to be extracted, the attribute data 
specifying the attributes of each item to be 
extracted and being used to oonditionalty 
retrieve the items (S240); and 

(e) converting, if necessary, item presentation 4S 
styles of the search result into the item presen- 
tation styles of each user aoooiding to the style 
conversion data (S250). 

24. A method of retrieving data through search engines so 
over open networks, comprising the steps of: 

(aa) finding, according to location data that 
specifies the location of each search engine, 
the location of a search engine that contains all 55 
search items specified in an entered query that 
consists of the search items and search condi- 
tions (S410.S420): 



(bb) selecting, according to essential input item 
data that specifies essential input items 
required by an input fomi of each search 
engine, search engine to be searched from 
among the location found search engines, the 
search engine of which the essential Input item 
satisfy the specified search condition (S430); 
(cc) determining an optimum retrieval pattem 
for each of the selected search engines 
according to a nuitrix table and converting the 
entered query into queries for the selected 
search engines accordingly, the matrix table 
defining con^ination between the search items 
and search conditions and the items and 
essential input items off each search engine 
(S440); 

(dd) converting, if necessary, item presentation 
styles of the queries provided by the step (cc) 
into item presentation styles of the search item 
in selected search engines according to style 
conversion data that is used to convert item 
presentation styles of a user and item presen- 
tation styles of each HTML document from one 
into another; (S450.S460) 
(ee) transmitting the queries obtained by the 
step (dd) to the found location and acquiring 
Hmt documoTts (S470); 
(ff) extracting item data from the acquired 
HTML document sending as first search result 
according to document structure data (S475). 
selecting, if necessary, the extracted item data 
according to attribute data for the searching 
condition on the tasis of corresponding 
retrieval pattem, and preparing a second 
search result (S480.S490), the document 
structure data specifying the structure of each 
HTML document and being used to delimit doc- 
ument into items to be extracted, the attribute 
data specifying the attributes of the items to be 
extracted and bang used to conditionally 
retrieve the Items; and 

(gg) converting. If necessary, item presentation 
styles of the second search result into item 
presentation styles of each user accorcfing to 
the style conversion data (S500). 

2& A method of extracting data item by item from arbi- 
trary HTML document over open networics. com- 
prising the steps of: 

(aaa) analyzing a template corresponding to 
acquired HTML document, the template for 
each HTML document being set according to 
document staicture data that specifies the 
structure of each HTML document and is used 
to delimit document into items to be extracted, 
the template stipulating at least item name to 
be extracted and prescribed text extraction 
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style data of item group to be extracted from 
the corresponding HTML document (S730); 
and 

(t^) comparing the acquired HTML docu- 
ments with corresponding template by scan- s 
ning the acc^tred HTML documertt, and 
extracting item data of the items watcNng the 
text extraction style data of the tenplate. so as 
to prepare a search result (S740.S750). 

10 

26. A computer readable recording medium recording a 
program for causing the computer to execute 
processing for retrieving data contained In a plural- 
ity xif semi-structured documents over open net- 
works, the processing including: is 

a process for retrieving the data scattered 
among semi-structured documents for errtered 
query according to meta data about each of the 
semi-structured documents and preparing a so 
coliective search result, the meta data includ- 
ing items to be extracted from the semi-struc- 
tured documents and item data used to 
conditionally retrieve the items; and 
a process for outputting the search result in a 2S 
prescribed single format that Is specific each 
the user. 

27. A computer readable recording medium recording a 
program for causing the computer to execute 30 
processing for retrieving data involved in a plurality 

of semi-structured documents over open networks, 
the processing including: 

(a) a process (131) for finding, according to as 
location data that specifies the location of each 

of the semi-structured documents, the location 
of a semi-structured document that contains all 
search items specified in an entered that con- 
sists of the search items and search conditions; 40 

(b) a process (132) for converting, if necessary, 
Item presentation styles of the entered query 
into item presentation styles of the search item 
in location found semi-structured documents 
according to style conversion data and forming 45 
queries for the location found semi-structured 
documents, the style converston data being 
used to convert item presentation styles of a 
user and item presentation styles of the semi- 
structured documents from one into another; so 

(c) a process (14) for transmitting the queries 
provided by the process (b) to the found loca- 
tions and acquiring the semi-structured docu- 
ments; 

(d) a process (134) for extracting item data ss 
from the acquired semi-structured documents 
accoreling to document structure data, select- 
ing the extracted item data. If necessary. 



according to attribute data for the search condi- 
tion and preparing a search result, the docu- 
ment structure data specifying the structure of 
each of the semi-structured documents and 
being used to delimit document into items to be 
extracted, the attribute data specifying the 
attributes of each item to be extracted and 
being used to oondltionaliy retrieve the items; 
and 

(e) a process (135) converting. If necessary, 
item presentation styles of the search result 
into the Item presentatton styles of each user 
according to the style conversion data. 

28. The recording medium of daim 27, wherein the 
process (d) 

compares the acquired semi-structured docu- 
ment with corresponding template, the tem- 
plate stipulating, for each of the semi- 
structured documents, at least item name to be 
extracted and prescribed text extraction style 
data of item group to be extracted according to 
the document structure data; and 
extracts item data of the items matching the 
text extraction template so as to prepare the 
search result 

29. TTie recording medium of claim 28, wherein the 
process (d) shapes the search result into a table. 

30. The recording medium of claim 28, wherein, if the 
text extraction style data of a given template 
includes link data to another semi-structured docu- 
m^, the process (d) scans a linked semi-struc- 
tured document and compares the linked semi- 
structured document with the template. 

31. The recording medium of claim 28, wherein: 

any template that is tor a semi-structured docu- 
ment having a plurality of partial structures of 
the same structure contains text extraction 
style data for each of the partial structures; and 
the process (d) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

32. The recording medium of claim 28, wherein: 

the template contains a plurality pieces of text 
extraction style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the process (d) extracts item data of the items 
matching the text extraction style data, by 
scanning the acquired semi-structured docu- 



29 



57 



EP0964 341 A2 



58 



ment. when the partial structure of the semi- 
structured document match any one piece of 
the extraction text style data. 

33. The recorcfing medium of claim 28. wherein: s 

any template that is for a semi-structured docu- 
ment having a plurality of partial structures 
containing mutually different elements contains 
text extraction style data for each of the partial io 
structures; and 

the process (d) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

IS 

34. A computer readable recording medium recording a 
program for causing the computer to execute 
processing for retrieve data through search engines 
over the open networks, the processing including: 

20 

(aa) a process (131) for finding, according to 
location data that specifies the location of each 
search engine, tiie location of a search engine 
thai contains all search items specified in an 
entered query that consists of tiie search items 2S 
and search conditions; 

(t)b) a process (136) for selecting, according to 
essential input item data that specifies essen- 
tial input items required by an input form of 
each search engne. search engine to be 30 
searched from among the location found 
search engines, ttie search engine of which the 
essential input item satisfy the specified search 
condition; 

(cc) a process (137) for determining an opti- 35 
mum retrieval pattern for each of the selected 
search engines according to a matrix table and 
converting the entered query into queries for 
the selected search engines accordingly, the 
matrix table defining combination between the 40 
search items and search conditions and the 
items and essential input Items of each search 
engine; 

(dd) a process (132) for converting, if neces- 
sary, item presentation styles of the queries 4S 
provided by tiie process (cc) into item presen- 
tation styles of the search item in selected 
search engines according to style conversion 
data that is used to convert item presentation 
styles of a user and item presentation styles of so 
each HTML document from one into another; 
(ee) a process (14) for transmitting tiie queries 
obtained by the process (dd) to the found loca- 
tion and acquiring HTML documents; 
(ff) a process (138) for extracting item data ss 
from the acquired HTML document serving as 
first search result according to document struc- 
ture data, selecting, if necessary, tiie extracted 



item data according to attribute data for tiie 
searching condition on the basis of conre- 
sponding retrieval pattern, and preparing a 
second search result, ttie document structure 
data specifying the structure of each HTML 
document and being used to delimit document 
into items to be extracted, the attribute data 
specifying the attributes of ttie items to t>e 
extracted and being used to conditionally 
retrieve the Items; and 

(gg) a process (135) for converting. If neces- 
sary, Item presentation styles of the second 
search result into item presentation styles of 
each user according to the style conversion 
data. 

35. The recording medium of daim 34. wherein tiie 
process (ff) 

compares the acquired hlTML document with 
corresponding template, the template stipulat- 
ing, for each of HTML documents, at least item 
name to be exb-acted and prescribed text 
extraction ' style data of item group to be 
extracted according to ttie document structure 
data; and 

extracts item data of the items matching the 
text extraction style data of tiie template so as 
to prepare tiie search result 

36. The recording medium of claim 35. wherein the 
process (fO shapes tiie search result into a table. 

37. The recording medium of claim 35, wherein, if the 
text extraction style data of a given template link 
data to anotfier document tiie process (fQ scans a 
linked HTML document and compares the linked 
HTML document with the template. 

38. The recording medium of claim 35, wherein: 

any tenplate tiiat is for an HTML document 
having a plurality of partial structures of ttie 
same structure contains text extraction style 
data for each of the partial structures; and 
the process (ff) extracts the item data so as to 
prepare tiie search result for each of ttie partial 
structures. 

39. The recording medium of daim 35. wherein: 

the template contains a pluraBty pieces of text 
exti'action style data for each of partial struc- 
tures, tiie text extraction style data being used 
for filtering uneven parts contained in ttie par- 
tial structure; and 

the process (ff) extracts item data of the items 
matching ttie text extraction style data, by 
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60 



scanning the acquired HTML document, when 
the partial structure of the HTML document 
match any one piece of the extraction text style 
data. 

40. The recording medium of claim 35, wherein: 

, any template that is for an HTML document 
having a plurality of partial structures contain- 
ing mutually different elements contains text 
extraction style data for each of the partial 
structures: and 

the process (ff) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

41 . A computer readable recording medium recoidtng a 
program for causing the computer to execute 
processing for extracting data item by item from 
arbitrary HTML documents over open networks, the 
processing including: 

(aaa) a process ( 1 341 ) for analyzing a template 
corresponding to acquired HTML document, 
the template for each HTML document being 
set according to document structure data that 
specifies the structure of each HTML docu- 
ment and is used to delimit document into 
items to be extracted, the template stipulating 
at least item name to be extracted and pre- 
saibed text extraction style data of item group 
to be extracted from the corresponding HTML 
document; and 

(bbb) a process (1343) for comparing the 
acquired HTML documents with conesponding 
the template by scanning the acquired HTML 
document, and extracting item data of the 
items matching the text extraction style data of 
the template, so as to prepare a search result 

42. The recording medium of claim 41. wherein the 
process (bbb) shapes the search result Into a table. 
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45. The recording medium of claim 41, wherein: 

the template contains a plurality pieces of text 
extraction style data for each of partial struc- 
tures, the text extraction style data, being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the process (bbb) extracts item data of items 
matching the extraction text style data, in par- 
tial structures thereof according to the first and 
second extraction style data of corresponding 
ones of the templates by scanning the obtained 
HTML document; when the partial structure of 
the HTML document match any one piece of 
the extraction text style data. 

46. The recording medium of claim 41, wherein: 

any template that Is for an HTML document 
having a plurality of partial structures contain- 
ing mutually different elements contains text 
extraction style data for each of the partial 
structures; and 

the process (bbb) extracts the item data so as 
to prepare the search result for each of the par- 
tial structures. 



43. The recording medium of claim 41, wherein. If the 
text extraction style data of a given template 45 
includes link data to another document, the process 
(bbb) scans a linked HTML document and com- 
pares the linked HTML document with the template. 



44. Therecordingmediumofdaim 41, wherein: so 



any template that is for an HTML document 
having a pbrality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and ss 
the process (bbb) extracts the item data so as 
to prepare the search result for each of the par- 
tial starctures. 
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FIG.9A 

EXEMPLARY DISPLAY BY WEB BROWSER 
Tmi:SHOP A PRODUCT INFORMATION 
URL:http://www.shop_a.co.jp/products.htinl 



PRODUCT INFORMATION 



PRODUCT NAME 


PRICE 


Maker A/PCl 


¥170,000 


Maker A/PC2. 


¥238,000 


Maker B/PClOl 


¥198,000 



nG.9B 



HTML DOCUMENT 



<BODY> 

<Hl>PRODUCT ]NFORMATION</Hl> 

<frABLE BORDER> 
<aTlxTH>PRODUCT NAME</rH><frH>PRICE</rH><rR> 
<TRxTD>Maker A/PCl</n)xTD>¥170,000<m»<;^rR> 
<rrRxTD>Maker A/PC2</ro><aT>>¥238,000</rD></TO> 
<frR><aB>Maker B/PC101</TD><rrD>¥l98,000</TIWrR> 
</TABLE> 
</BODY> 
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FIG.lOA 
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FIG.20 



Page_B HTML 

< ! DOCTYPEHTML PUBUC"-//W3CV/DTD W3 HTML//EN"> 

<HTML> 

<HEAD> 

<TITLE>pagc_B</TrrLE> 

</HEAD> 

<BODY> 

<FORM action=http:/Avww. page_b.co.jp/cgi-bin/search.cgi method=GET> 

SHOPS TO FlNIXBRxBR> 

<rABLE> 

<rm> 

SHOP NAME 
</TH> 
<TH> 
AREA 

<rm> 
</rR> 
<rrR> 
<^rD> 

<INPUT name=key size=30> 
</rD> 

<SELECT naine=area> 

<OPTION value=00 SELECrED> 

<OPnON value=01>YOKOSUKA-SHI 

<OPnON value=02>FUnSAWA-SHI 

<OPTION value=03>HIRATSUKA-SHI 
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<INPUT type=reset value=CLEAR> 

</FORM> 

</BODY> 
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